<a href="https://colab.research.google.com/github/Joshi-kv/gen-ai/blob/main/Huggingface/Hugging_Face_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers



**NLP Tasks**

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I was waiting for this moment")

print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.971838653087616}]


In [None]:
pipeline("sentiment-analysis", model="FutureMa/Qwen3-4B-Evasion")("I was waiting for this moment")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of Qwen3ForSequenceClassification were not initialized from the model checkpoint at FutureMa/Qwen3-4B-Evasion and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Invalid model-index. Not loading eval results into CardData.
Device set to use cuda:0


[{'label': 'LABEL_0', 'score': 0.9795114994049072}]

**Text Generation**

In [None]:
text_genration_classifier = pipeline("text-generation", model="distilbert/distilgpt2")
generated_text = text_genration_classifier("Today is rainy day in London", truncation=True, num_return_sequences=2)

print("Generated texts:")
for i, result_dict in enumerate(generated_text):
    print(f"Sequence {i+1}: {result_dict['generated_text']}")

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated texts:
Sequence 1: Today is rainy day in London. The weather has been chilly since the beginning of September and for far too long, the streets and streets are now flooded with rain.



The BBC‏ The BBC has reported that the weather has been dry for more than a week.
The BBC will update this article after the publication of the report.
Sequence 2: Today is rainy day in London. It is often a very cold day.




"People walk to work and they are on the edge of walking down the street. They are walking through this street and then they get to work and they get to work and they get to work and they get to work and they are all just walking by the block or by the road.
"And then they get to work and they aren't walking by the street. So they are walking by other people and they're walking by a bus, walking by a bus. But that's what they're doing and they're walking by the car."
But there is a reason why the day is not always so different.
"In the afternoon you can't walk up to work

**Tokenization**

In [None]:
from transformers import AutoTokenizer

#load pre trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

#sample text
text = "This is a sample text for tokenization"

#tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens : ", tokens)

#convert token to input ids
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input Ids : ", input_ids)

#encode text (tokenization + converting to input ids)
encode = tokenizer(text)
print("Encode : ", encode)

#decode the text
decode = tokenizer.decode(input_ids)
print("Decode : ", decode)


Tokens :  ['This', 'is', 'a', 'sample', 'text', 'for', 'token', '##ization']
Input Ids :  [1188, 1110, 170, 6876, 3087, 1111, 22559, 2734]
Encode :  {'input_ids': [101, 1188, 1110, 170, 6876, 3087, 1111, 22559, 2734, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Decode :  This is a sample text for tokenization


**Finetuning Models**

In [None]:
!pip install datasets



# **1. Step 1 load and prepare the dataset**

In [None]:
import datasets
dataset = datasets.load_dataset('imdb')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

# **Step 2 Process the data**

In [None]:
from transformers import AutoTokenizer

#load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

#tokenize the dataset
def tokenize_function(example):
  return tokenizer(example['text'], padding='max_length', truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [None]:
tokenized_datasets['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

# **Step 3 Setup training aruguments**

specify the hyperparameters and training settings


In [None]:
from transformers import TrainingArguments

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
)

In [None]:
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False,

# **Step 4 Initialize Model**

In [None]:
from transformers import Trainer, AutoModelForSequenceClassification

#load pretrained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

#initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# **Step 5 Train the model**

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
#evaluate the model
result = trainer.evaluate()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
#save finetuned models
model.save_pretrained('./fine-tuned-models')
tokenizer.save_pretrained('./fine-tuned-models')

('./fine-tuned-models/tokenizer_config.json',
 './fine-tuned-models/special_tokens_map.json',
 './fine-tuned-models/vocab.txt',
 './fine-tuned-models/added_tokens.json',
 './fine-tuned-models/tokenizer.json')