In [2]:
import torch

model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"

# 检查是否可以访问 CUDA
print("CUDA is available:", torch.cuda.is_available())

if torch.cuda.is_available():
    print("GPU device name:", torch.cuda.get_device_name(0))
    print("Number of GPUs available:", torch.cuda.device_count())
else:
    print("No GPU available, using CPU")

CUDA is available: True
GPU device name: NVIDIA GeForce RTX 4090
Number of GPUs available: 1


### xlm-roberta-base-language-detection
This model is a fine-tuned version of xlm-roberta-base on the Language Identification dataset.

#### Model description
This model is an XLM-RoBERTa transformer model with a classification head on top (i.e. a linear layer on top of the pooled output). For additional information please refer to the xlm-roberta-base model card or to the paper Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al.

#### Intended uses & limitations
You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 20 languages:

arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh)

#### Model page
https://huggingface.co/papluca/xlm-roberta-base-language-detection

In [13]:
from transformers import pipeline

pipe = pipeline("sentiment-analysis",model = model_path + "xlm-roberta-base-language-detection")
pipe("今儿上海可真冷啊")

[{'label': 'zh', 'score': 0.9885185360908508}]

### 测试更多示例

In [14]:
pipe("我觉得这家店蒜泥白肉的味道一般")

[{'label': 'zh', 'score': 0.9933799505233765}]

In [7]:
pipe("你学东西真的好快，理论课一讲就明白了")

[{'label': 'zh', 'score': 0.9934589266777039}]

In [8]:
pipe("You learn things really quickly. You understand the theory class as soon as it is taught.")

[{'label': 'en', 'score': 0.9935345649719238}]

In [9]:
pipe("Today Shanghai is really cold.")

[{'label': 'en', 'score': 0.962114691734314}]

### 批处理调用模型推理

In [10]:
text_list = [
    "Today Shanghai is really cold.",
    "I think the taste of the garlic mashed pork in this store is average.",
    "You learn things really quickly. You understand the theory class as soon as it is taught."
]

pipe(text_list)

[{'label': 'en', 'score': 0.962114691734314},
 {'label': 'en', 'score': 0.9939622282981873},
 {'label': 'en', 'score': 0.9935345649719238}]

## 使用 Pipeline API 调用更多预定义任务

## Natural Language Processing(NLP)

**NLP**(自然语言处理)任务是最常见的任务类型之一，因为文本是我们进行交流的一种自然方式。要将文本转换为模型可识别的格式，需要对其进行分词。这意味着将一系列文本划分为单独的单词或子词（标记），然后将这些标记转换为数字。结果就是，您可以将一系列文本表示为一系列数字，并且一旦您拥有了一系列数字，它就可以输入到模型中来解决各种NLP任务！

上面演示的 文本分类任务，以及接下来的标记、问答等任务都属于 NLP 范畴。

### Token Classification

在任何NLP任务中，文本都经过预处理，将文本序列分成单个单词或子词。这些被称为tokens。

**Token Classification**（Token分类）将每个token分配一个来自预定义类别集的标签。

两种常见的 Token 分类是：

- 命名实体识别（NER）：根据实体类别（如组织、人员、位置或日期）对token进行标记。NER在生物医学设置中特别受欢迎，可以标记基因、蛋白质和药物名称。
- 词性标注（POS）：根据其词性（如名词、动词或形容词）对标记进行标记。POS对于帮助翻译系统了解两个相同的单词如何在语法上不同很有用（作为名词的银行与作为动词的银行）。



In [None]:
### tner/roberta-large-ontonotes5
This model is a fine-tuned version of roberta-large on the tner/ontonotes5 dataset. Model fine-tuning is done via T-NER's hyper-parameter search (see the repository for more detail). It achieves the following results on the test set:

F1 (micro): 0.908632361399938
Precision (micro): 0.905148095909732
Recall (micro): 0.9121435551212579
F1 (macro): 0.8265477704565624
Precision (macro): 0.8170668848546687
Recall (macro): 0.8387672780349001

#### Model page
https://huggingface.co/tner/roberta-large-ontonotes5

In [3]:
from transformers import pipeline

classifier = pipeline(task="ner",model = model_path + "roberta-large-ontonotes5")

In [4]:
preds = classifier("Hugging Face is a French company based in New York City.")
preds = [
    {
        "entity": pred["entity"],
        "score": round(pred["score"], 4),
        "index": pred["index"],
        "word": pred["word"],
        "start": pred["start"],
        "end": pred["end"],
    }
    for pred in preds
]
print(*preds, sep="\n")

{'entity': 'B-ORG', 'score': 0.9999, 'index': 1, 'word': 'Hug', 'start': 0, 'end': 3}
{'entity': 'I-ORG', 'score': 1.0, 'index': 2, 'word': 'ging', 'start': 3, 'end': 7}
{'entity': 'I-ORG', 'score': 1.0, 'index': 3, 'word': 'ĠFace', 'start': 8, 'end': 12}
{'entity': 'B-NORP', 'score': 0.9999, 'index': 6, 'word': 'ĠFrench', 'start': 18, 'end': 24}
{'entity': 'B-GPE', 'score': 0.9994, 'index': 10, 'word': 'ĠNew', 'start': 42, 'end': 45}
{'entity': 'I-GPE', 'score': 0.9996, 'index': 11, 'word': 'ĠYork', 'start': 46, 'end': 50}
{'entity': 'I-GPE', 'score': 0.9994, 'index': 12, 'word': 'ĠCity', 'start': 51, 'end': 55}


#### 合并实体

In [6]:
classifier = pipeline(task="ner",model =  model_path +"roberta-large-ontonotes5", aggregation_strategy="simple")
classifier("Hugging Face is a French company based in New York City.")

[{'entity_group': 'ORG',
  'score': 0.9999604,
  'word': 'Hugging Face',
  'start': 0,
  'end': 12},
 {'entity_group': 'NORP',
  'score': 0.99992514,
  'word': ' French',
  'start': 18,
  'end': 24},
 {'entity_group': 'GPE',
  'score': 0.9994443,
  'word': ' New York City',
  'start': 42,
  'end': 55}]

### Question Answering

**Question Answering**(问答)是另一个token-level的任务，返回一个问题的答案，有时带有上下文（开放领域），有时不带上下文（封闭领域）。每当我们向虚拟助手提出问题时，例如询问一家餐厅是否营业，就会发生这种情况。它还可以提供客户或技术支持，并帮助搜索引擎检索您要求的相关信息。

有两种常见的问答类型：

- 提取式：给定一个问题和一些上下文，模型必须从上下文中提取出一段文字作为答案
- 生成式：给定一个问题和一些上下文，答案是根据上下文生成的；这种方法由`Text2TextGenerationPipeline`处理，而不是下面展示的`QuestionAnsweringPipeline`

This model can be used for Extractive QA
It has been finetuned for 3 epochs on SQuAD2.0.

#### Model page
https://huggingface.co/timpal0l/mdeberta-v3-base-squad2

In [7]:
from transformers import pipeline

question_answerer = pipeline(task="question-answering",model = model_path + "mdeberta-v3-base-squad2")

In [8]:
preds = question_answerer(
    question="What is the name of the repository?",
    context="The name of the repository is huggingface/transformers",
)
print(
    f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

score: 0.9839, start: 29, end: 54, answer:  huggingface/transformers


In [9]:
preds = question_answerer(
    question="What is the capital of China?",
    context="On 1 October 1949, CCP Chairman Mao Zedong formally proclaimed the People's Republic of China in Tiananmen Square, Beijing.",
)
print(
    f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

score: 0.9296, start: 114, end: 123, answer:  Beijing.


### Summarization

**Summarization**(文本摘要）从较长的文本中创建一个较短的版本，同时尽可能保留原始文档的大部分含义。摘要是一个序列到序列的任务；它输出比输入更短的文本序列。有许多长篇文档可以进行摘要，以帮助读者快速了解主要要点。法案、法律和财务文件、专利和科学论文等文档可以摘要，以节省读者的时间并作为阅读辅助工具。

与问答类似，摘要有两种类型：

- 提取式：从原始文本中识别和提取最重要的句子
- 生成式：从原始文本中生成目标摘要（可能包括输入文件中没有的新单词）；`SummarizationPipeline`使用生成式方法

T5 model for multilingual text Summary in English, Russian and Chinese language
This model is designed to perform the task of controlled generation of summary text content in multitasking mode with a built-in translation function for languages: Russian, Chinese, English.

This is the T5 multitasking model. Which has a conditionally controlled ability to generate summary text content, and translate this. In total, she understands 12 commands, according to the set prefix:

"summary: " - to generate simple concise content in the source language
"summary brief: " - to generate a shortened summary content in the source language
"summary big: " - to generate elongated summary content in the source language
The model can understand text in any language from the list: Russian, Chinese or English. It can also translate the result into any language from the list: Russian, Chinese or English.

For translation into the target language, the target language identifier is specified as a prefix "... to :". Where lang can take the values: ru, en, zh. The source language may not be specified, in addition, the source text may be multilingual.

#### Model page
https://huggingface.co/utrobinmv/t5_summary_en_ru_zh_base_2048

In [15]:
from transformers import pipeline

summarizer = pipeline(task="summarization",
                      model=model_path+"t5_summary_en_ru_zh_base_2048",
                      min_length=8,
                      max_length=32
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [16]:
summarizer(
    """
    In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, 
    replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. 
    For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. 
    On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. 
    In the former task our best model outperforms even all previously reported ensembles.
    """
)


[{'summary_text': "In this week's Scrubbing Up, WMT 2014 English-to-German and World Trade Organization (WMT) "}]

In [17]:
summarizer(
    '''
    Large language models (LLM) are very large deep learning models that are pre-trained on vast amounts of data. 
    The underlying transformer is a set of neural networks that consist of an encoder and a decoder with self-attention capabilities. 
    The encoder and decoder extract meanings from a sequence of text and understand the relationships between words and phrases in it.
    Transformer LLMs are capable of unsupervised training, although a more precise explanation is that transformers perform self-learning. 
    It is through this process that transformers learn to understand basic grammar, languages, and knowledge.
    Unlike earlier recurrent neural networks (RNN) that sequentially process inputs, transformers process entire sequences in parallel. 
    This allows the data scientists to use GPUs for training transformer-based LLMs, significantly reducing the training time.
    '''
)


[{'summary_text': 'Understand the role of transformer-based LLMs. Know the difference between transformers and recurrent neural networks.'}]

In [20]:
summarizer(
    '''
    姜维（202年—264年3月3日）[1]，字伯约，凉州天水郡冀县（今甘肃省天水市甘谷县）人。三国时期蜀汉著名军事家。原为曹魏天水郡中郎将，后降蜀汉，深受诸葛亮器重。蒋琬、费袆先后逝世后，姜维总领蜀汉军权，并先后十一次伐魏。其后，司马昭灭蜀汉，姜维在剑阁防守锺会。邓艾出奇兵从阴平小路历经艰辛，突然出现在成都附近，诸葛亮子诸葛瞻战死，后主刘禅降魏，蜀汉灭亡[2]:9。姜维打算利用锺会的野心复国，遂降于锺会并共同发动叛乱，但因事败死于乱军之中，享年六十二岁。
生平
凉州异才
良田百顷，不在一亩；但有远志，不在当归。
“”
姜维与母亲书, 《三国志·姜维传》注引《杂记》
姜维出生于202年，父亲姜冏是天水郡守的佐官，曾任郡功曹，早年于羌、戎叛乱中，战死沙场。姜维与母亲相依为命，喜欢汉朝学者郑玄学说。时常结交一些豪杰，暗中养了些死士，心中有大志。初为曹魏中郎[3]，参天水郡军事[4]:1062。建兴六年（228年）诸葛亮出兵祁山，姜维及功曹梁绪、主簿尹赏、主记梁虔等正与天水太守马遵、雍州刺史郭淮外出巡视，马遵听到蜀军将至且诸县响应，怀疑其麾下的姜维等人皆有异心，遂趁夜逃到上邽。当姜维等人察觉马遵已逃走，想回去，但城门已关闭。去冀城，也被拒门外，遂投降诸葛亮，姜维母亲则滞留魏国[5]。
颐和园长廊上的“收姜维”情节
诸葛亮征辟姜维为仓曹掾，加奉义将军，封当阳亭侯，时年27岁。诸葛亮曾与张裔、蒋琬书称：“姜维忠勤时事，思虑精密，考察他所拥有之才能，李邵、马良都比不上。此人，乃凉州之上等人才。”(有一说为凉州最杰出之人）又说：“姜维在军事上很有见解，既有胆色、明义理，深解兵法意理。此人心存汉室，才能兼备于人，须先教他操练中虎步兵五六千人，将军事全教给他，当带他进宫，觐见天子。”后来，姜维迁为中监军、征西将军。
军旅生涯
234年，诸葛亮死于五丈原后，姜维返回成都，为右监军、辅汉将军，统率诸军，进封平襄县侯。238年，随大将军蒋琬（诸葛亮后继者）迁往汉中。蒋琬不久升为大司马，便以姜维为司马，数次率偏军西入。243年，升为镇西大将军，领凉州刺史。247年，升卫将军，与大将军费祎共同行使尚书事权。是年，汶山平康蛮人反叛，姜维率众讨伐平定。出在陇西、南安、金城郡边界，与魏国前将军郭淮、右将军夏侯霸等于洮西交战。249年，刘禅授姜维假节，姜维出兵西平。姜维每次想大举出兵，费祎常不依从，限制给他不超过一万名士兵，因此没有重大斩获[4]:1064。
北伐中原
主条目：姜维北伐

'''
)

[{'summary_text': '姜维在军事上很有见解,既有胆色、明义理,深解兵法意理。此人心存汉室'}]


## Audio 音频处理任务

音频和语音处理任务与其他模态略有不同，主要是因为音频作为输入是一个连续的信号。与文本不同，原始音频波形不能像句子可以被划分为单词那样被整齐地分割成离散的块。为了解决这个问题，通常在固定的时间间隔内对原始音频信号进行采样。如果在每个时间间隔内采样更多样本，采样率就会更高，音频更接近原始音频源。

以前的方法是预处理音频以从中提取有用的特征。现在更常见的做法是直接将原始音频波形输入到特征编码器中，以提取音频表示。这样可以简化预处理步骤，并允许模型学习最重要的特征。

### Audio classification

**Audio classification**(音频分类)是一项将音频数据从预定义的类别集合中进行标记的任务。这是一个广泛的类别，具有许多具体的应用，其中一些包括：

- 声学场景分类：使用场景标签（“办公室”、“海滩”、“体育场”）对音频进行标记。
- 声学事件检测：使用声音事件标签（“汽车喇叭声”、“鲸鱼叫声”、“玻璃破碎声”）对音频进行标记。
- 标记：对包含多种声音的音频进行标记（鸟鸣、会议中的说话人识别）。
- 音乐分类：使用流派标签（“金属”、“嘻哈”、“乡村”）对音乐进行标记。

Audio Spectrogram Transformer (fine-tuned on AudioSet)
Audio Spectrogram Transformer (AST) model fine-tuned on AudioSet. It was introduced in the paper AST: Audio Spectrogram Transformer by Gong et al. and first released in this repository.

Disclaimer: The team releasing Audio Spectrogram Transformer did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description
The Audio Spectrogram Transformer is equivalent to ViT, but applied on audio. Audio is first turned into an image (as a spectrogram), after which a Vision Transformer is applied. The model gets state-of-the-art results on several audio classification benchmarks.

Usage
You can use the raw model for classifying audio into one of the AudioSet classes. See the documentation for more info.

#### Model page
https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593

数据集主页：https://huggingface.co/datasets/superb#er

```
情感识别（ER）为每个话语预测一个情感类别。我们采用了最广泛使用的ER数据集IEMOCAP，并遵循传统的评估协议：我们删除不平衡的情感类别，只保留最后四个具有相似数量数据点的类别，并在标准分割的五折交叉验证上进行评估。评估指标是准确率（ACC）。
```

#### 前置依赖包安装

建议在命令行安装必要的音频数据处理包: ffmpeg

```shell
$apt update & apt upgrade
$apt install -y ffmpeg
$pip install ffmpeg ffmpeg-python
```

In [25]:
from transformers import pipeline

classifier = pipeline(task="audio-classification", model=model_path+"ast-finetuned-audioset-10-10-0.4593",num_mel_filters=64)

In [26]:
# 使用 Hugging Face Datasets 上的测试文件
preds = classifier("https://hf-mirror.com/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds

[{'score': 0.4208, 'label': 'Speech'},
 {'score': 0.1793, 'label': 'Rain on surface'},
 {'score': 0.1301, 'label': 'Rain'},
 {'score': 0.096, 'label': 'Raindrop'},
 {'score': 0.0578, 'label': 'Music'}]

In [27]:
# 使用本地的音频文件做测试
preds = classifier("data/audio/mlk.flac")
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds

[{'score': 0.4208, 'label': 'Speech'},
 {'score': 0.1793, 'label': 'Rain on surface'},
 {'score': 0.1301, 'label': 'Rain'},
 {'score': 0.096, 'label': 'Raindrop'},
 {'score': 0.0578, 'label': 'Music'}]

### Automatic speech recognition（ASR）

**Automatic speech recognition**（自动语音识别）将语音转录为文本。这是最常见的音频任务之一，部分原因是因为语音是人类交流的自然形式。如今，ASR系统嵌入在智能技术产品中，如扬声器、电话和汽车。我们可以要求虚拟助手播放音乐、设置提醒和告诉我们天气。

但是，Transformer架构帮助解决的一个关键挑战是低资源语言。通过在大量语音数据上进行预训练，仅在一个低资源语言的一小时标记语音数据上进行微调，仍然可以产生与以前在100倍更多标记数据上训练的ASR系统相比高质量的结果。
### openai/whisper-base
Whisper
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al from OpenAI. The original code repository can be found here.

Disclaimer: Content for this model card has partly been written by the Hugging Face team, and parts of it were copied and pasted from the original model card.

Model details
Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision.

The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained on both speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio.

#### Model page
https://huggingface.co/openai/whisper-base

下面展示使用 `OpenAI Whisper Base` 模型实现 ASR 的 Pipeline API 示例：

In [3]:
from transformers import pipeline

# 使用 `model` 参数指定模型
transcriber = pipeline(task="automatic-speech-recognition", model=model_path+"whisper-base")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
text = transcriber("data/audio/mlk.flac")
text

{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

## Computer Vision 计算机视觉

**Computer Vision**（计算机视觉）任务中最早成功之一是使用卷积神经网络（CNN）识别邮政编码数字图像。图像由像素组成，每个像素都有一个数值。这使得将图像表示为像素值矩阵变得容易。每个像素值组合描述了图像的颜色。

计算机视觉任务可以通过以下两种通用方式解决：

- 使用卷积来学习图像的层次特征，从低级特征到高级抽象特征。
- 将图像分成块，并使用Transformer逐步学习每个图像块如何相互关联以形成图像。与CNN偏好的自底向上方法不同，这种方法有点像从一个模糊的图像开始，然后逐渐将其聚焦清晰。

### Image Classificaiton

**Image Classificaiton**(图像分类)将整个图像从预定义的类别集合中进行标记。像大多数分类任务一样，图像分类有许多实际用例，其中一些包括：

- 医疗保健：标记医学图像以检测疾病或监测患者健康状况
- 环境：标记卫星图像以监测森林砍伐、提供野外管理信息或检测野火
- 农业：标记农作物图像以监测植物健康或用于土地使用监测的卫星图像
- 生态学：标记动物或植物物种的图像以监测野生动物种群或跟踪濒危物种

### google/vit-base-patch16-224
#### Vision Transformer (base-sized model)
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.

Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.

#### Model description
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
#### Model page：
https://huggingface.co/google/vit-base-patch16-224


### google/vit-base-patch16-384
#### Vision Transformer (base-sized model)
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.

Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.

#### Model description
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at a higher resolution of 384x384.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.

#### Intended uses & limitations
You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.

In [13]:
from transformers import pipeline
classifier = pipeline(task="image-classification",model=model_path+"vit-base-patch16-384")

In [14]:
preds = classifier(
    "https://hf-mirror.com/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
print(*preds, sep="\n")

{'score': 0.2729, 'label': 'lynx, catamount'}
{'score': 0.0445, 'label': 'Egyptian cat'}
{'score': 0.0424, 'label': 'tabby, tabby cat'}
{'score': 0.0417, 'label': 'snow leopard, ounce, Panthera uncia'}
{'score': 0.0345, 'label': 'tiger cat'}


![](data/image/cat-chonk.jpeg)

In [15]:
# 使用本地图片（狼猫）
preds = classifier(
    "data/image/cat-chonk.jpeg"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
print(*preds, sep="\n")

{'score': 0.2729, 'label': 'lynx, catamount'}
{'score': 0.0445, 'label': 'Egyptian cat'}
{'score': 0.0424, 'label': 'tabby, tabby cat'}
{'score': 0.0417, 'label': 'snow leopard, ounce, Panthera uncia'}
{'score': 0.0345, 'label': 'tiger cat'}


![](data/image/panda.jpg)

In [16]:
# 使用本地图片（熊猫）
preds = classifier(
    "data/image/panda.jpg"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
print(*preds, sep="\n")

{'score': 0.9936, 'label': 'giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca'}
{'score': 0.0015, 'label': 'lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens'}
{'score': 0.0005, 'label': 'ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus'}
{'score': 0.0004, 'label': 'sloth bear, Melursus ursinus, Ursus ursinus'}
{'score': 0.0002, 'label': 'American black bear, black bear, Ursus americanus, Euarctos americanus'}


### Object Detection

与图像分类不同，目标检测在图像中识别多个对象以及这些对象在图像中的位置（由边界框定义）。目标检测的一些示例应用包括：

- 自动驾驶车辆：检测日常交通对象，如其他车辆、行人和红绿灯
- 遥感：灾害监测、城市规划和天气预报
- 缺陷检测：检测建筑物中的裂缝或结构损坏，以及制造业产品缺陷

模型主页：https://huggingface.co/facebook/detr-resnet-50

### facebook/detr-resnet-101
#### DETR (End-to-End Object Detection) model with ResNet-101 backbone
DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper End-to-End Object Detection with Transformers by Carion et al. and first released in this repository.

Disclaimer: The team releasing DETR did not write a model card for this model so this model card has been written by the Hugging Face team.

#### Model description
The DETR model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for the class labels and a MLP (multi-layer perceptron) for the bounding boxes. The model uses so-called object queries to detect objects in an image. Each object query looks for a particular object in the image. For COCO, the number of object queries is set to 100.

The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.
#### Model page：
https://huggingface.co/facebook/detr-resnet-101

### hustvl/yolos-tiny
#### YOLOS (tiny-sized) model
YOLOS model fine-tuned on COCO 2017 object detection (118k annotated images). It was introduced in the paper You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection by Fang et al. and first released in this repository.

Disclaimer: The team releasing YOLOS did not write a model card for this model so this model card has been written by the Hugging Face team.

#### Model description
YOLOS is a Vision Transformer (ViT) trained using the DETR loss. Despite its simplicity, a base-sized YOLOS model is able to achieve 42 AP on COCO validation 2017 (similar to DETR and more complex frameworks such as Faster R-CNN).

The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.

#### Intended uses & limitations
You can use the raw model for object detection. See the model hub to look for all available YOLOS models.

In [3]:
from transformers import pipeline
import os

detector = pipeline(task="object-detection",model= model_path + "yolos-tiny",local_files_only=True)

In [10]:
preds = detector(
    "https://pic.rmb.bdstatic.com/bjh/news/8a7aa17a4d5bd3ed4db4fb1182e8b350.png@s_0,w_1242"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"], "box": pred["box"]} for pred in preds]
preds

[{'score': 0.9895,
  'label': 'dog',
  'box': {'xmin': 345, 'ymin': 10, 'xmax': 995, 'ymax': 1140}}]

![](data/image/cat_dog.jpg)

In [5]:
preds = detector(
    "data/image/cat_dog.jpg"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"], "box": pred["box"]} for pred in preds]
preds

[{'score': 0.9946,
  'label': 'cat',
  'box': {'xmin': 75, 'ymin': 60, 'xmax': 290, 'ymax': 369}},
 {'score': 0.9899,
  'label': 'dog',
  'box': {'xmin': 280, 'ymin': 18, 'xmax': 479, 'ymax': 416}}]