<a href="https://colab.research.google.com/github/Howl06/practice/blob/main/HuggingFace/distilbert_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers datasets evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Dataset

In [2]:
from datasets import load_dataset
# load dataset
imdb = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [4]:
imdb["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

In [5]:
# Models: https://huggingface.co/models

MODEL_NAME = "distilbert-base-uncased"

In [6]:
from transformers import AutoTokenizer

# build tokenizer by model name
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) 

In [7]:
# vocab: word to id
tokenizer.get_vocab()

{'##lce': 23314,
 '##leton': 19263,
 '##ত': 29898,
 'weekend': 5353,
 'symmetry': 14991,
 '##ivation': 25761,
 '##dication': 25027,
 'cipher': 27715,
 'battery': 6046,
 'pasta': 24857,
 'fated': 27442,
 'mood': 6888,
 'chatham': 16727,
 'nature': 3267,
 '##aby': 21275,
 'ranch': 8086,
 'what': 2054,
 '##都': 30487,
 'technique': 6028,
 'songs': 2774,
 'ev': 23408,
 'daring': 15236,
 'streams': 9199,
 'brand': 4435,
 'essays': 8927,
 'van': 3158,
 'registers': 18687,
 'pressed': 4508,
 'philharmonic': 12355,
 'explain': 4863,
 '₤': 1572,
 'barron': 23594,
 'conquer': 16152,
 '同': 1794,
 'blanc': 18698,
 'smeared': 25400,
 '##duced': 28901,
 'permitting': 24523,
 'squeeze': 11025,
 '」': 1642,
 '##own': 12384,
 'ro': 20996,
 'brethren': 17937,
 '##weiler': 22384,
 'bragg': 23678,
 'tiny': 4714,
 'crying': 6933,
 'devotees': 22707,
 'quadrant': 29371,
 '³': 1083,
 '城': 1804,
 'girlfriend': 6513,
 'skinner': 17451,
 'committed': 5462,
 'bundesliga': 14250,
 'treasure': 8813,
 '##main': 24238

In [8]:
# use tokenize all data
def preprocess_function(examples):
    return tokenizer(examples["text"], 
                     truncation=True, # 裁切
                     max_length=50) # 句子長度

tokenized_imdb = imdb.map(preprocess_function, batched=True)



Map:   0%|          | 0/25000 [00:00<?, ? examples/s]



In [9]:
tokenized_imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [10]:
id2text = {v: k for k, v in tokenizer.get_vocab().items()}
id2text

{23314: '##lce',
 19263: '##leton',
 29898: '##ত',
 5353: 'weekend',
 14991: 'symmetry',
 25761: '##ivation',
 25027: '##dication',
 27715: 'cipher',
 6046: 'battery',
 24857: 'pasta',
 27442: 'fated',
 6888: 'mood',
 16727: 'chatham',
 3267: 'nature',
 21275: '##aby',
 8086: 'ranch',
 2054: 'what',
 30487: '##都',
 6028: 'technique',
 2774: 'songs',
 23408: 'ev',
 15236: 'daring',
 9199: 'streams',
 4435: 'brand',
 8927: 'essays',
 3158: 'van',
 18687: 'registers',
 4508: 'pressed',
 12355: 'philharmonic',
 4863: 'explain',
 1572: '₤',
 23594: 'barron',
 16152: 'conquer',
 1794: '同',
 18698: 'blanc',
 25400: 'smeared',
 28901: '##duced',
 24523: 'permitting',
 11025: 'squeeze',
 1642: '」',
 12384: '##own',
 20996: 'ro',
 17937: 'brethren',
 22384: '##weiler',
 23678: 'bragg',
 4714: 'tiny',
 6933: 'crying',
 22707: 'devotees',
 29371: 'quadrant',
 1083: '³',
 1804: '城',
 6513: 'girlfriend',
 17451: 'skinner',
 5462: 'committed',
 14250: 'bundesliga',
 8813: 'treasure',
 24238: '##main'

In [11]:
id2text[101]

'[CLS]'

In [12]:
tokenizer.get_vocab()["i"]

1045

In [13]:
from pprint import pprint

data = tokenized_imdb["train"][0]

for key in ['text', 'label', 'input_ids', 'attention_mask']:
    print(f'{key}: ', data[key])

text:  I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwe

In [14]:
from transformers import DataCollatorWithPadding
# padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#### Metrics

In [15]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

In [16]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, 
                            references=labels)

In [17]:
# id & label mapping
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [18]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
                              MODEL_NAME, # model name
                              num_labels=2, # number of classes
                              id2label=id2label, 
                              label2id=label2id
                              )

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

In [19]:
BS = 32

training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=BS,
    per_device_eval_batch_size=BS,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False, # 上傳到hub
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4891,0.413966,0.79604
2,0.3525,0.424327,0.80264


TrainOutput(global_step=1564, training_loss=0.4116560089618654, metrics={'train_runtime': 303.0292, 'train_samples_per_second': 165.001, 'train_steps_per_second': 5.161, 'total_flos': 646813470000000.0, 'train_loss': 0.4116560089618654, 'epoch': 2.0})

#### Inference

pipeline: https://huggingface.co/docs/transformers/v4.28.1/en/quicktour#pipeline

In [20]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [21]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", # task
            # model="stevhliu/my_awesome_model", # from huggingface hub
            model="./my_awesome_model/checkpoint-1564", # local
            )

classifier(text)

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9847169518470764}]

#### Resources

1. Custom dataset: https://huggingface.co/transformers/v3.2.0/custom_datasets.html#seq-imdb
2. Notebooks: https://huggingface.co/docs/transformers/notebooks
3. Pipeline: https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines