## Hugging Face

**Hugging Face** 是一个开源平台，提供了一系列工具和资源，用于处理自然语言处理（NLP）和计算机视觉项目。该平台提供模型托管、分词器、机器学习应用程序、数据集以及用于训练和实施AI模型的教育材料。

使用 **Hugging Face** 进行文本分类编程的特点包括：

1. **预训练模型**：Hugging Face 提供了大量预训练模型，如BERT、GPT-2等，这些模型已经在大规模数据集上进行了训练，可以直接用于文本分类任务。
   
2. **易于使用**：Hugging Face 提供了简单易用的API和工具，使得开发者可以轻松地加载模型、进行微调和部署。

3. **多样性**：支持多种模型架构，适用于各种文本、图像和音频任务。

4. **社区支持**：Hugging Face 拥有活跃的社区，用户可以共享模型、数据集和应用程序，促进协作和知识共享。

5. **灵活性**：Hugging Face 的Transformers库与PyTorch、TensorFlow和JAX深度学习库兼容，为不同的深度学习需求提供支持。

6. **高效性**：通过转移学习，用户可以利用预训练模型进行微调，而不必从头开始训练模型，这大大提高了开发效率和模型性能。

总之，Hugging Face 是一个功能强大的平台，适用于从事NLP和计算机视觉项目的研究人员和开发者。它的易用性和强大的预训练模型库使得文本分类等任务变得更加高效和准确。

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch

In [2]:
dataset = load_dataset('csv', data_files='data.csv', delimiter=',',  names=["label", "review"])
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'review'],
        num_rows: 1243
    })
})

In [3]:
dataset = dataset['train'].train_test_split(test_size=0.3)
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'review'],
        num_rows: 870
    })
    test: Dataset({
        features: ['label', 'review'],
        num_rows: 373
    })
})

In [4]:
dataset['train'][0]

{'label': '0', 'review': 'xx人分别享有世界众筹节发起人权益'}

In [5]:
tokenizer = AutoTokenizer.from_pretrained(r"E:\code\distilbert-base-cased")

In [6]:
def preprocess_function(examples):
    return tokenizer(examples["review"], truncation=True)
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/870 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/373 [00:00<?, ? examples/s]

In [7]:
tokenized_dataset['train'][0]

{'label': '0',
 'review': 'xx人分别享有世界众筹节发起人权益',
 'input_ids': [101,
  22038,
  1756,
  1775,
  100,
  100,
  1873,
  1745,
  100,
  100,
  100,
  100,
  100,
  100,
  1756,
  100,
  100,
  102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [8]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='E:\code\distilbert-base-cased', vocab_size=30522, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple

In [9]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score


def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [10]:
id2label = {0: "正常短信", 1: "垃圾短信"}
label2id = {"正常短信": 0, "垃圾短信": 1}

In [11]:
model = AutoModelForSequenceClassification.from_pretrained(
    r"E:\code\distilbert-base-cased", num_labels=2, id2label=id2label, label2id=label2id
)
print(model)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at E:\code\distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [12]:
total_params = 0
for name, parameters in model.named_parameters():
    if not parameters.requires_grad: continue
    print(name, ':', parameters.size())
    total_params += parameters.numel()
print("模型需要训练参数为：", total_params)

distilbert.embeddings.word_embeddings.weight : torch.Size([30522, 768])
distilbert.embeddings.position_embeddings.weight : torch.Size([512, 768])
distilbert.embeddings.LayerNorm.weight : torch.Size([768])
distilbert.embeddings.LayerNorm.bias : torch.Size([768])
distilbert.transformer.layer.0.attention.q_lin.weight : torch.Size([768, 768])
distilbert.transformer.layer.0.attention.q_lin.bias : torch.Size([768])
distilbert.transformer.layer.0.attention.k_lin.weight : torch.Size([768, 768])
distilbert.transformer.layer.0.attention.k_lin.bias : torch.Size([768])
distilbert.transformer.layer.0.attention.v_lin.weight : torch.Size([768, 768])
distilbert.transformer.layer.0.attention.v_lin.bias : torch.Size([768])
distilbert.transformer.layer.0.attention.out_lin.weight : torch.Size([768, 768])
distilbert.transformer.layer.0.attention.out_lin.bias : torch.Size([768])
distilbert.transformer.layer.0.sa_layer_norm.weight : torch.Size([768])
distilbert.transformer.layer.0.sa_layer_norm.bias : torch.

In [13]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="no",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [14]:
trainer.train()

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`label` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

In [None]:
predictions = trainer.predict(tokenized_dataset["test"])
#pred_labels = [id2label[prediction] for prediction in predictions.predictions.argmax(-1)]
pred_labels = [prediction for prediction in predictions.predictions.argmax(-1)]

In [None]:
from sklearn import metrics
classify_report = metrics.classification_report(tokenized_dataset['test']['label'], pred_labels, digits = 4) #分类报告 support测试集样本数
print(classify_report) 
confusion_matrix = metrics.confusion_matrix(tokenized_dataset['test']['label'], pred_labels) #混淆矩阵
print(confusion_matrix) 