- [Hugging Face](https://huggingface.co/)
- [Spaces - Hugging Face](https://huggingface.co/spaces)
- [HF-Mirror](https://hf-mirror.com/)
- [HF LLM Course](https://huggingface.co/learn/llm-course/chapter1/1)

用 Pipeline 完成多种 NLP 任务是一个非常强大且高效的方法。它可以将多个独立的 NLP 模型串联起来，形成一个端到端的处理流程，从而解决更复杂的自然语言理解和生成问题。

您可以利用 Pipeline 来完成以下多种 NLP 任务的组合：
1. 文本分类与命名实体识别 (NER):
> Pipeline: 文本分类器 -> NER 模型
应用场景: 分析用户评论的情感，并识别评论中提到的产品名称、公司名称等实体。
示例流程:
输入用户评论：“我非常喜欢这款苹果手机，它的电池续航很棒！”
文本分类器分析情感为“积极”。
NER 模型识别出“苹果手机”为产品实体，“电池续航”为属性。
输出：情感：积极，实体：[{'entity': '产品', 'value': '苹果手机'}, {'entity': '属性', 'value': '电池续航'}]

2. 文本摘要与问答:

> Pipeline: 文本摘要模型 -> 问答模型
应用场景: 快速理解长篇文章的内容，并针对关键信息进行提问。
示例流程:
输入一篇新闻报道。
文本摘要模型生成该报道的简洁摘要。
针对摘要提出问题：“报道中提到了哪个国家？”
问答模型根据摘要回答：“报道中提到了美国。”
3. 机器翻译与文本生成:

>Pipeline: 机器翻译模型 -> 文本生成模型
应用场景: 将一种语言的文本翻译成另一种语言，并根据翻译结果进行进一步的文本生成，例如生成回复或摘要。
示例流程:
输入一句中文：“今天天气真好。”
机器翻译模型将其翻译成英文：“The weather is really nice today.”
文本生成模型根据翻译结果生成一句相关的回复：“Let's go for a walk!”
4. 情感分析、文本分类与关系抽取:

> Pipeline: 情感分析模型 -> 文本分类模型 -> 关系抽取模型
应用场景: 分析文本的情感倾向，将其归类到特定主题，并识别文本中实体之间的关系。
示例流程:
输入一段客户反馈：“我对这家餐厅的服务非常不满，菜品也很难吃。我再也不会来了！”
情感分析模型判断情感为“负面”。
文本分类模型将其归类为“餐饮服务投诉”。
关系抽取模型识别出“客户”与“餐厅”之间存在“不满”的关系，“客户”与“菜品”之间存在“难吃”的关系。
实现 Pipeline 的常用工具和库:

Hugging Face Transformers: 提供了 pipeline 类，可以轻松地加载和组合预训练模型，完成各种 NLP 任务。您只需要指定任务类型和预训练模型名称，pipeline 会自动处理分词、模型加载、推理等步骤。
spaCy: 虽然 spaCy 的 pipeline 主要用于构建单个 NLP 模型的处理流程（例如，分词、词性标注、NER、依存句法分析），但您也可以通过自定义组件的方式将多个独立的 spaCy 模型或自定义逻辑组合成更复杂的 pipeline。
NLTK: NLTK 提供了构建 NLP 任务模块的工具，您可以手动将不同的模块组合成 pipeline。
自定义 Python 代码: 您可以根据具体需求，使用任何 NLP 库（如 scikit-learn、TensorFlow、PyTorch 等）构建独立的模型，并通过自定义 Python 函数将它们串联起来。
使用 Hugging Face Transformers 的 pipeline 示例:

- 文本分类：finbert、roberta-base-go_emotions、twitter-roberta-base-sentiment-latest
- 问答：roberta-base-squad2、xlm-roberta-large-squad2、distilbert-base-cased-distilled-squad
- 零样本分类：bart-large-mnli、mDeBERTa-v3-base-mnli-xnli
- 翻译：t5-base、 opus-mt-zh-en、translation_en-zh
- 总结：bart-large-cnn、led-base-book-summary
- 文本生成：Baichuan-13B-Chat、falcon-40b、starcoder
- 文本相似度：all-MiniLML6-v2、text2vec-large-chinese、all-mpnet-base-v2

In [2]:
from transformers import pipeline

# 情感分析 pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
result_sentiment = sentiment_pipeline("I love using the transformers library!")
print(f"情感分析结果: {result_sentiment}")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


情感分析结果: [{'label': 'POSITIVE', 'score': 0.9993904829025269}]


In [5]:
# 命名实体识别 pipeline
ner_pipeline = pipeline("ner")
result_ner = ner_pipeline("My name is Taylor and I live in Beijing.")
# print(f"命名实体识别结果: {result_ner}")
# 循环输出识别结果
for entity in result_ner:
    print(f"实体: {entity['word']}, 类型: {entity['entity']}")



No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


实体: Taylor, 类型: I-PER
实体: Beijing, 类型: I-LOC


In [7]:

# 问答 pipeline
question_answerer = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")
result_qa = question_answerer(question="What's my name?", context="My name is Taylor and I live in Beijing.")
print(f"问答结果: {result_qa}")
print(f"答案: {result_qa['answer']}")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


问答结果: {'score': 0.963897705078125, 'start': 11, 'end': 17, 'answer': 'Taylor'}
答案: Taylor


In [10]:

# 文本摘要 pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
text_to_summarize = """
My name is Taylor and I live in Beijing. I'm a freelance writer and photographer. 
"""
result_summary = summarizer(text_to_summarize, max_length=30, min_length=5, do_sample=False)
print(f"文本摘要结果: {result_summary}")


Device set to use mps:0
Your max_length is set to 30, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


文本摘要结果: [{'summary_text': 'Taylor is a freelance writer and photographer. He lives in Beijing.'}]


In [13]:

# 可以将多个 pipeline 组合起来
def complex_pipeline(text):
    sentiment_result = sentiment_pipeline(text)[0]
    ner_result = ner_pipeline(text)
    return {"sentiment": sentiment_result, "entities": ner_result}

complex_result = complex_pipeline("This amazing Apple product is fantastic!")
print(f"复杂 pipeline 结果: {complex_result}")
print(f"情感分析结果: {complex_result['sentiment']}")
print(f"命名实体识别结果: {complex_result['entities']}")

for entity in complex_result['entities']:
    print(f"实体: {entity['word']}, 类型: {entity['entity']}")


复杂 pipeline 结果: {'sentiment': {'label': 'POSITIVE', 'score': 0.9998869895935059}, 'entities': [{'entity': 'I-ORG', 'score': 0.98976, 'index': 3, 'word': 'Apple', 'start': 13, 'end': 18}]}
情感分析结果: {'label': 'POSITIVE', 'score': 0.9998869895935059}
命名实体识别结果: [{'entity': 'I-ORG', 'score': 0.98976, 'index': 3, 'word': 'Apple', 'start': 13, 'end': 18}]
实体: Apple, 类型: I-ORG


## 其他应用

- [OutfitAnyone - a Hugging Face Space by HumanAIGC](https://huggingface.co/spaces/HumanAIGC/OutfitAnyone)
- 