### Logging into Hugging Face Hub from a Notebook

The following code demonstrates how to log into your Hugging Face account directly from a Jupyter notebook using the `notebook_login()` function from the `huggingface_hub` library.


In [68]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [3]:
!pip install transformers --upgrade



### Loading a Pretrained Model and Tokenizer from Hugging Face

This code snippet demonstrates how to load a pretrained model and tokenizer from Hugging Face using the `transformers` library. In this case, we're using the `distilbert-base-uncased` model for sequence classification tasks.


In [30]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


For increasing the training speed we are using `accelerator` library of hugging face

In [31]:
from accelerate import Accelerator

# Inatilize the accelerator
accelerator = Accelerator()

In [6]:
! pip install datasets



### Loading a Dataset from Hugging Face
 In this case, we are loading the **AG News** dataset, which is used for sequence classification tasks. It contains news articles categorized into four classes, making it a typical text classification task.

In [32]:
from datasets import load_dataset

# Load the Ag News dataset
dataset = load_dataset("ag_news")

In [8]:
dataset.shape

{'train': (120000, 2), 'test': (7600, 2)}

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [10]:
dataset['train'][0]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'label': 2}

In [11]:
dataset['test'][12]

{'text': "Dutch Retailer Beats Apple to Local Download Market  AMSTERDAM (Reuters) - Free Record Shop, a Dutch music  retail chain, beat Apple Computer Inc. to market on Tuesday  with the launch of a new download service in Europe's latest  battleground for digital song services.",
 'label': 3}

In [12]:
dataset['train'][1:3]

{'text': ['Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',
  "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums."],
 'label': [2, 2]}

### Tokenizing the Dataset
A function to tokenize the text data in the dataset using a pre-trained tokenizer. Tokenization is an essential preprocessing step before passing the data into a model for training or evaluation.

In [33]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [34]:
# Map the tokenization function to the dataset
encoded_dataset = dataset.map(tokenize_function, batched=True,num_proc=4)

In [15]:
train_dataset = encoded_dataset["train"]
eval_dataset = encoded_dataset["test"]

In [16]:
train_dataset.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

### Using a Data Collator for Padding
Handle padding dynamically during training or evaluation. The collator ensures that the inputs are padded to the maximum length of the batch, rather than a fixed length.

### Defining Training Arguments for Model Training

In [53]:
mymodel_name = 'distilbert-base-uncased-finetuned/ag_news_AK'

In [54]:
from transformers import TrainingArguments

# Define training Arguments
training_args = TrainingArguments(
    mymodel_name,
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="steps",
    load_best_model_at_end=True,
    push_to_hub=True,
    gradient_accumulation_steps=4,
    save_steps = 1000 ,
    fp16 = True
)
training_args




TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=500,
eval_strategy=IntervalStrategy.STEPS,
eval_u

In [18]:
!pip install evaluate



### Computing Accuracy Metric for Model Evaluation
Defines a function `compute_metric` to evaluate the performance of a model based on its accuracy. It uses the `evaluate` library to compute the accuracy of the model's predictions.

In [55]:
import evaluate
import numpy as np

def compute_metric(eval_pred):
  metric = evaluate.load('accuracy')
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [56]:
# Load mode for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=4)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
from transformers import DataCollatorWithPadding

# Define data collator with padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [58]:
# Prepare the dataset for pytorch
encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


### DataLoader for Efficient Training and Evaluation

In [59]:
from torch.utils.data import DataLoader

# Create DataLoaders for efficient data loading
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    collate_fn=data_collator,
    num_workers=4,
    pin_memory=True,
)

eval_loader = DataLoader(
    eval_dataset,
    batch_size=32,
    shuffle=False,
    collate_fn=data_collator,
    num_workers=4,
    pin_memory=True,
)



# Learning Rate Scheduling with Warmup
 we use the `get_linear_schedule_with_warmup` scheduler from the Hugging Face `transformers` library. This scheduler gradually increases the learning rate during the "warm-up" phase, then linearly decreases it during the rest of the training.


In [60]:
from transformers import get_linear_schedule_with_warmup

# Define learning rate scheduler
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
num_train_steps = len(train_loader) * training_args.num_train_epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=num_train_steps,
)

### Using the `Trainer` Class for Model Training and Evaluation

In [62]:
from transformers import Trainer

trainer = Trainer(
    model = model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    optimizers=(optimizer, scheduler),
    compute_metrics=compute_metric,
)

  trainer = Trainer(


In [63]:
from accelerate import Accelerator

# Prepare trainer with accelerator
trainer = accelerator.prepare(trainer)

In [64]:
# Fine-tune the model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
500,0.2594,0.184652,0.936974


Could not locate the best model at distilbert-base-uncased-finetuned/ag_news_AK/checkpoint-500/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.


TrainOutput(global_step=937, training_loss=0.22389550142888706, metrics={'train_runtime': 1421.0272, 'train_samples_per_second': 84.446, 'train_steps_per_second': 0.659, 'total_flos': 1.5888176591142912e+16, 'train_loss': 0.22389550142888706, 'epoch': 0.9994666666666666})

In [65]:
# Evaluate the model
print("Evaluating model...")
results = trainer.evaluate()
print(results)

Evaluating model...


{'eval_loss': 0.17110693454742432, 'eval_accuracy': 0.9418421052631579, 'eval_runtime': 32.3253, 'eval_samples_per_second': 235.11, 'eval_steps_per_second': 7.363, 'epoch': 0.9994666666666666}


In [67]:
# Save the model and tokenizer locally first
model.save_pretrained('./finetuned_Ag_news_AK')
tokenizer.save_pretrained('./finetuned_Ag_news_AK')

('./finetuned_Ag_news_AK/tokenizer_config.json',
 './finetuned_Ag_news_AK/special_tokens_map.json',
 './finetuned_Ag_news_AK/vocab.txt',
 './finetuned_Ag_news_AK/added_tokens.json',
 './finetuned_Ag_news_AK/tokenizer.json')

In [69]:
trainer.push_to_hub()

events.out.tfevents.1741515971.ea21d995018a.7837.3:   0%|          | 0.00/411 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Adiii143/ag_news_AK/commit/85281a7aef51de0d0245a5d5d0cf43311879b2e6', commit_message='End of training', commit_description='', oid='85281a7aef51de0d0245a5d5d0cf43311879b2e6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Adiii143/ag_news_AK', endpoint='https://huggingface.co', repo_type='model', repo_id='Adiii143/ag_news_AK'), pr_revision=None, pr_num=None)

In [71]:
model.push_to_hub("distilbert-base-uncased-finetuned-ag_news")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Adiii143/distilbert-base-uncased-finetuned-ag_news/commit/66524813b296c64513e8ff34cc59e5dc4db18281', commit_message='Upload DistilBertForSequenceClassification', commit_description='', oid='66524813b296c64513e8ff34cc59e5dc4db18281', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Adiii143/distilbert-base-uncased-finetuned-ag_news', endpoint='https://huggingface.co', repo_type='model', repo_id='Adiii143/distilbert-base-uncased-finetuned-ag_news'), pr_revision=None, pr_num=None)

In [72]:
tokenizer.push_to_hub("distilbert-base-uncased-finetuned-ag_news_AK")

CommitInfo(commit_url='https://huggingface.co/Adiii143/distilbert-base-uncased-finetuned-ag_news_AK/commit/2f14046d61fbfa071d8d87646f7ba09775cff648', commit_message='Upload tokenizer', commit_description='', oid='2f14046d61fbfa071d8d87646f7ba09775cff648', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Adiii143/distilbert-base-uncased-finetuned-ag_news_AK', endpoint='https://huggingface.co', repo_type='model', repo_id='Adiii143/distilbert-base-uncased-finetuned-ag_news_AK'), pr_revision=None, pr_num=None)

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('your-username/my_model')
tokenizer = AutoTokenizer.from_pretrained('your-username/my_model')
