NLP with Hugging Face


Using Pre-trained models

- Load a pre-trained text classification model using the AutoTokenizer and the AutoModelForSequenceClassification classes from transformers.

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
#defining a model object
model = AutoModelForSequenceClassification.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/759 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

- Preparing input: Load a tokenizer for the model


In [2]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

- Run the model: Generate a pipeline object with the chosen model, the tokenizer, and the task to be performed. In our case, a sentiment analysis. If you initialize the classifier object with the task, the pipeline class will populate it with the default values, even though it is not recommended in production.

In [3]:
from transformers import pipeline
# Initializing a classifier with a model and a tokenizer
classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

- Interpret the outputs: The model will return an object containing various elements depending on the model's class. For example, for this sentiment analysis example, we will get:

In [4]:
output = classifier("I've been waiting for this tutorial all my life!")
output

[{'label': 'POSITIVE', 'score': 0.9680967926979065}]

- The model predicts a very positive sentiment

Model Fine-tuning


Fine-tuning is the process of taking a pre-trained model and updating its parameters by training on a dataset specific to your task. This allows you to leverage the model's learned representations and adapt them to your use case

In [7]:
#initialize the model and the dataset
from datasets import load_dataset
model = AutoModelForSequenceClassification.from_pretrained(model_name)
dataset = load_dataset("mteb/tweet_sentiment_extraction")

README.md:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/465k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/27481 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3534 [00:00<?, ? examples/s]

In [8]:
# load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
  return tokenizer(examples["text"], padding = "max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

In [9]:
#Build a PyTorch dataset with encodings:
#The third step is to generate a train and testing dataset.
#The training set will be used to fine-tune our model, while the testing set will be used to evaluate it
model = AutoModelForSequenceClassification.from_pretrained(model_name)
dataset = load_dataset("mteb/tweet_sentiment_extraction")
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [11]:
#Fine-tune the model: Our final step is to set up the training arguments and start the training process.
#The transformers library contains the trainer() class, which takes care of everything.
#We first define the training arguments together with the evaluation strategy.
#Once everything is defined, we can easily train the model with the train() command.
from transformers import Trainer, TrainingArguments
import evaluate
import numpy as np
training_args = TrainingArguments(output_dir="trainer_output",evaluation_strategy="epoch")
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references = labels)


trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = small_train_dataset,
    eval_dataset = small_test_dataset,
    compute_metrics = compute_metrics
)
trainer.train()



Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.921575,0.583
2,No log,0.862148,0.653


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.921575,0.583
2,No log,0.862148,0.653
3,No log,1.136633,0.679


TrainOutput(global_step=375, training_loss=0.6114600423177083, metrics={'train_runtime': 14501.3406, 'train_samples_per_second': 0.207, 'train_steps_per_second': 0.026, 'total_flos': 397409283072000.0, 'train_loss': 0.6114600423177083, 'epoch': 3.0})

In [12]:
#Evaluate the model: After training, evaluate the model's performance on a validation or test set.
import evaluate
trainer.evaluate()


{'eval_loss': 1.136633038520813,
 'eval_accuracy': 0.679,
 'eval_runtime': 1002.7454,
 'eval_samples_per_second': 0.997,
 'eval_steps_per_second': 0.125,
 'epoch': 3.0}

Model presents an accuracy of 67%

In [13]:
#sharing model
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [16]:
trainer.push_to_hub("my-basic-nlp-model")


model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

events.out.tfevents.1732092515.74d6274105a1.813.1:   0%|          | 0.00/411 [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

events.out.tfevents.1732076249.74d6274105a1.813.0:   0%|          | 0.00/6.51k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.30k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mugambi645/trainer_output/commit/dbf7e0297ec6c043f6ee4a137bbb80cea4f1cfe7', commit_message='my-basic-nlp-model', commit_description='', oid='dbf7e0297ec6c043f6ee4a137bbb80cea4f1cfe7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mugambi645/trainer_output', endpoint='https://huggingface.co', repo_type='model', repo_id='mugambi645/trainer_output'), pr_revision=None, pr_num=None)

Using Hugging Face

If we want to standardize any NLP process, with Hugging Face, it usually involves three simple steps that take less than five lines of code:

1. Define a model object with the pipeline class (and the corresponding model and tokenizer).

2. Define the input text or prompt.

3. Execute the pre-trained model with our input and observe the output.

1.Text classification

Text classification is a fundamental task in NLP. It consists of assigning to every input text one or more categories. This can be used for a variety of applications such as spam detection, sentiment analysis, topic labeling, and more

In [18]:
# Import pipeline module from transformers
from transformers import pipeline

# We load the pre-trained text classification model.
classifier = pipeline("text-classification",model='lxyuan/distilbert-base-multilingual-cased-sentiments-student')
input_ = "I love Hugging Face so much"
output_ = classifier(input_)
print(output_)

[{'label': 'positive', 'score': 0.9525280594825745}]


2.Text generation

In [19]:
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
prompt = "AI is changing the world"
generated_text = generator(prompt, max_length=50)[0]["generated_text"]
print(generated_text)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


AI is changing the world, for the better.

In the coming months, we will all be in a fight.

If we don't act now, when does it end?

Follow us to learn more

(By


3.Question answering

Question answering, commonly referred to as QA, is a field in NLP focused on building systems that automatically answer questions posed by humans in natural language

In [21]:
from transformers import pipeline
qa_pipeline = pipeline("question-answering", model = 'distilbert-base-uncased-distilled-squad' )
context = "Many wars were fought in ancient world such as battle of Thermopylae and Red cliffs"
question = "Which are the biggest battles in ancient world?"
ans = qa_pipeline(question=question, context = context)
print(ans)

{'score': 0.9522756338119507, 'start': 47, 'end': 83, 'answer': 'battle of Thermopylae and Red cliffs'}


4.Translation

Translating a language to another


In [24]:
#translating a text from english to german
translator = pipeline("translation_en_to_de")
txt_to_translate = "The universe is vast!"
translate = translator(txt_to_translate, max_length=40)
print(translate[0]["translation_text"])

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Das Universum ist riesig!
