<a href="https://colab.research.google.com/github/ThousandAI/Application-of-AI/blob/main/class09/transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **[Transformers Hugging Face](https://huggingface.co/docs/transformers/index)**

In [1]:
!pip install transformers
from transformers import pipeline

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **Pipeline**

In [2]:
pipe = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [3]:
pipe("Even if the AI ​​course is difficult, we still enjoy the course content brought to us by the teacher.")

[{'label': 'POSITIVE', 'score': 0.9997095465660095}]

In [4]:
pipe("Code is more difficult to understand than other courses.")

[{'label': 'NEGATIVE', 'score': 0.9992061257362366}]

In [5]:
pipe(["Even if the AI ​​course is difficult, we still enjoy the course content brought to us by the teacher",
      "Code is more difficult to understand than other courses."])

[{'label': 'POSITIVE', 'score': 0.9996920824050903},
 {'label': 'NEGATIVE', 'score': 0.9992061257362366}]

## **English Model**

In [6]:
english_model_name = "distilbert-base-uncased-finetuned-sst-2-english"

In [7]:
from transformers import AutoTokenizer

In [8]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=english_model_name)

In [9]:
tokenizer.tokenize("Even if the AI ​​course is difficult, we still enjoy the course content brought to us by the teacher.")

['even',
 'if',
 'the',
 'ai',
 'course',
 'is',
 'difficult',
 ',',
 'we',
 'still',
 'enjoy',
 'the',
 'course',
 'content',
 'brought',
 'to',
 'us',
 'by',
 'the',
 'teacher',
 '.']

In [10]:
tokenizer("Even if the AI ​​course is difficult, we still enjoy the course content brought to us by the teacher.")

{'input_ids': [101, 2130, 2065, 1996, 9932, 2607, 2003, 3697, 1010, 2057, 2145, 5959, 1996, 2607, 4180, 2716, 2000, 2149, 2011, 1996, 3836, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
X_train = ["Even if the AI ​​course is difficult, we still enjoy the course content brought to us by the teacher",
           "Code is more difficult to understand than other courses."]

batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors="pt")
print(batch)

{'input_ids': tensor([[ 101, 2130, 2065, 1996, 9932, 2607, 2003, 3697, 1010, 2057, 2145, 5959,
         1996, 2607, 4180, 2716, 2000, 2149, 2011, 1996, 3836,  102],
        [ 101, 3642, 2003, 2062, 3697, 2000, 3305, 2084, 2060, 5352, 1012,  102,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [12]:
from transformers import AutoModelForSequenceClassification
import torch
import torch.nn as nn
import torch.nn.functional as F

In [13]:
english_model = AutoModelForSequenceClassification.from_pretrained(english_model_name)

In [14]:
with torch.no_grad():
  outputs= english_model(**batch)
  #outputs= model(**batch, labels=torch.tensor([1, 0]))
  print(outputs)
  predictions = F.softmax(outputs.logits, dim=1)
  print(predictions)
  labels = torch.argmax(predictions, dim=1)
  print(labels)
  labels_mapping = [english_model.config.id2label[label_id] for label_id in labels.tolist()]
  print(labels_mapping)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.9319,  4.1532],
        [ 3.9425, -3.1952]]), hidden_states=None, attentions=None)
tensor([[3.0798e-04, 9.9969e-01],
        [9.9921e-01, 7.9389e-04]])
tensor([1, 0])
['POSITIVE', 'NEGATIVE']


## **Chinese Model**

In [15]:
chinese_pipe = pipeline('fill-mask', model="bert-base-chinese", tokenizer="bert-base-chinese")

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
chinese_pipe("這門課雖然很"+chinese_pipe.tokenizer.mask_token+"但我還是努力將它完成")

[{'score': 0.48670488595962524,
  'sequence': '這 門 課 雖 然 很 難 但 我 還 是 努 力 將 它 完 成',
  'token': 7432,
  'token_str': '難'},
 {'score': 0.10295306146144867,
  'sequence': '這 門 課 雖 然 很 累 但 我 還 是 努 力 將 它 完 成',
  'token': 5168,
  'token_str': '累'},
 {'score': 0.06878484040498734,
  'sequence': '這 門 課 雖 然 很 短 但 我 還 是 努 力 將 它 完 成',
  'token': 4764,
  'token_str': '短'},
 {'score': 0.05155231058597565,
  'sequence': '這 門 課 雖 然 很 苦 但 我 還 是 努 力 將 它 完 成',
  'token': 5736,
  'token_str': '苦'},
 {'score': 0.023980475962162018,
  'sequence': '這 門 課 雖 然 很 忙 但 我 還 是 努 力 將 它 完 成',
  'token': 2564,
  'token_str': '忙'}]

In [17]:
chinese_pipe("今晚訂"+chinese_pipe.tokenizer.mask_token+"已經客滿，明天下午還有空位")

[{'score': 0.49522095918655396,
  'sequence': '今 晚 訂 位 已 經 客 滿 ， 明 天 下 午 還 有 空 位',
  'token': 855,
  'token_str': '位'},
 {'score': 0.37189793586730957,
  'sequence': '今 晚 訂 房 已 經 客 滿 ， 明 天 下 午 還 有 空 位',
  'token': 2791,
  'token_str': '房'},
 {'score': 0.07200019806623459,
  'sequence': '今 晚 訂 單 已 經 客 滿 ， 明 天 下 午 還 有 空 位',
  'token': 1606,
  'token_str': '單'},
 {'score': 0.01261210162192583,
  'sequence': '今 晚 訂 票 已 經 客 滿 ， 明 天 下 午 還 有 空 位',
  'token': 4873,
  'token_str': '票'},
 {'score': 0.011162370443344116,
  'sequence': '今 晚 訂 座 已 經 客 滿 ， 明 天 下 午 還 有 空 位',
  'token': 2429,
  'token_str': '座'}]