# NLP con Hugging Face

## Procesando los datos para NLP

### Descargando el dataset

In [1]:
%%capture
!pip install datasets transformers evaluate

Usaremos el dataset MRPC.

, seleccionamos el subset `mrpc` del dataset `glue`:

In [2]:
from datasets import load_dataset

ds = load_dataset("glue", "mrpc")

Así se ve un ejemplo. Notamos que `mrpc` está compuesto de dos oraciones y una etiqueta que indica si los dos enunciados son equivalentes.

In [3]:
ex = ds["train"][400]
ex

{'sentence1': 'U.S. Agriculture Secretary Ann Veneman , who announced Tuesdays ban , also said Washington would send a technical team to Canada to help .',
 'sentence2': "U.S. Agriculture Secretary Ann Veneman , who announced yesterday 's ban , also said Washington would send a technical team to Canada to assist in the Canadian situation .",
 'label': 1,
 'idx': 446}

vemos cuales son las etiquetas de nuestros datos

In [4]:
labels = ds["train"].features["label"]

In [5]:
labels.int2str(1)

'equivalent'

### Tokenizando


Descargamos el tokenizador directamente del repo del modelo que usaremos.

In [10]:

from transformers import AutoTokenizer

repo_id = "distilroberta-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

Creamos una función tokenizadora. Recibe un ejemplo y lo tokeniza.

In [11]:
def tokenize_fn(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [12]:
prepared_ds = ds.map(tokenize_fn, batched=True)

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

### Definiendo el data collator: Dynamic padding

Rellenemos (hagamos padding) todos los ejemplos con la longitud del elemento más largo del batch.

In [13]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Entrenamiento y evaluación

definimos los argumentos de entrenamiento y el trainer


### Definimos la métrica

In [14]:
import evaluate
import numpy as np

def compute_metrics(eval_pred):
  metric = evaluate.load("glue", "mrpc")
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

### Configuramos `Trainer`


In [15]:
from transformers import AutoModelForSequenceClassification

labels = ds["train"].features["label"].names

model = AutoModelForSequenceClassification.from_pretrained(
    repo_id,
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)}
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
%%capture
!pip install transformers[torch]


In [17]:
%%capture
!pip install accelerate -U

###definimos nuestros argumentos

In [24]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = "./distilroberta-base-mrpc-glue-francisco-flores",
    evaluation_strategy="steps",
    num_train_epochs=10,
    push_to_hub=True,
    load_best_model_at_end=True
)

In [19]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### preparamos nuestro trainer

In [25]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=prepared_ds["train"],
    eval_dataset=prepared_ds["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

### Entrenamiento

In [26]:
train_results = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)

Step,Training Loss,Validation Loss,Accuracy,F1
500,0.338,0.778242,0.838235,0.889632
1000,0.3216,0.718382,0.845588,0.886894
1500,0.1861,1.10946,0.835784,0.887395
2000,0.1101,1.3526,0.82598,0.879865
2500,0.0572,1.24636,0.82598,0.875657
3000,0.0443,1.219425,0.840686,0.886562
3500,0.0321,1.351915,0.833333,0.880282
4000,0.0146,1.499911,0.830882,0.884034
4500,0.0082,1.490843,0.833333,0.883959


***** train metrics *****
  epoch                    =       10.0
  total_flos               =   638575GF
  train_loss               =     0.1214
  train_runtime            = 0:08:00.69
  train_samples_per_second =     76.306
  train_steps_per_second   =      9.549


### Evaluación

In [27]:
metrics = trainer.evaluate(prepared_ds["validation"])
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

***** eval metrics *****
  epoch                   =       10.0
  eval_accuracy           =     0.8456
  eval_f1                 =     0.8869
  eval_loss               =     0.7184
  eval_runtime            = 0:00:01.88
  eval_samples_per_second =    216.932
  eval_steps_per_second   =     27.116


### Compartimos en el Hub

In [32]:
kwargs = {
    "finetuned_from": model.config._name_or_path,
    "tasks": "text-classification",
    "dataset": "glue",
    "tags": ["text-classification"]
}

trainer.push_to_hub(commit_message="modelo final de NPL logrado! 🤗", **kwargs)

'https://huggingface.co/franciscoafy/distilroberta-base-mrpc-glue-francisco-flores/tree/main/'