In [1]:
from transformers import pipeline


## 文本分类

In [2]:

# 文本分类
classifier = pipeline("sentiment-analysis")

result = classifier("I hate you")[0]
print(result)

result = classifier("I love you")[0]
print(result)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'label': 'NEGATIVE', 'score': 0.9991129040718079}
{'label': 'POSITIVE', 'score': 0.9998656511306763}


## 阅读理解

In [3]:

# 阅读理解
question_answerer = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a 
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune 
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

result = question_answerer(question="What is extractive question answering?",
                           context=context)
print(result)

result = question_answerer(
    question="What is a good example of a question answering dataset?",
    context=context)

print(result)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6177283525466919, 'start': 34, 'end': 95, 'answer': 'the task of extracting an answer from a text given a question'}
{'score': 0.5152308940887451, 'start': 148, 'end': 161, 'answer': 'SQuAD dataset'}


## 完形填空

In [4]:
# 完形填空
unmasker = pipeline("fill-mask")

from pprint import pprint

sentence = 'HuggingFace is creating a <mask> that the community uses to solve NLP tasks.'

unmasker(sentence)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.17927497625350952,
  'token': 3944,
  'token_str': ' tool',
  'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.'},
 {'score': 0.11349403858184814,
  'token': 7208,
  'token_str': ' framework',
  'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.'},
 {'score': 0.05243556201457977,
  'token': 5560,
  'token_str': ' library',
  'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.'},
 {'score': 0.03493537753820419,
  'token': 8503,
  'token_str': ' database',
  'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.'},
 {'score': 0.02860264666378498,
  'token': 17715,
  'token_str': ' prototype',
  'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.'}]

## 文本生成

In [5]:
# 文本生成
text_generator = pipeline("text-generation")

text_generator("As far as I am concerned, I will",
               max_length=50,
               do_sample=False)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

## 命名实体识别

In [6]:
ner_pipe = pipeline("ner")

sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""


for entity in ner_pipe(sequence):
    print(entity)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'entity': 'I-ORG', 'score': 0.99957865, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9909764, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982224, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9994879, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994344, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.99931955, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9993794, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.98625815, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.95142686, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.93365884, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9761654, 'index': 28, 'word': 'Manhattan', 'sta

## 文本总结

In [7]:

# 文本总结
summarizer = pipeline("summarization")


ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""


summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]

## 翻译

In [10]:
from transformers import AutoModelWithLMHead, AutoTokenizer

"""
参考: https://huggingface.co/liam168/trans-opus-mt-en-zh/blob/main/README.md?code=true#L28
"""

mode_name = 'liam168/trans-opus-mt-en-zh'
model = AutoModelWithLMHead.from_pretrained(mode_name)
tokenizer = AutoTokenizer.from_pretrained(mode_name)



Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/310M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/299 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/806k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.75M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



In [13]:

translation = pipeline("translation_en_to_zh", 
                       model=model, 
                       tokenizer=tokenizer)

In [14]:

# 翻译
translation('I like to study Data Science and Machine Learning.', max_length=400)

[{'translation_text': '我喜欢学习数据科学和机器学习。'}]

In [16]:
# 翻译
translation(ARTICLE, 
            max_length=500)

Your input_length: 477 is bigger than 0.9 * max_length: 500. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


[{'translation_text': '在纽约(CNN),当Liana Barrientos 23岁时,她再次结了婚。一年后,她又在Westchester郡结了婚。一年后,她又在Westchester郡结了婚,在Westchester郡结了婚,但是她又嫁给了一个不同的男人。结婚只有18天后,她又结了婚。结婚只有18天。之后,Barrientos又宣布了五次“我做”,有时只在两周之内。2010年,她再次结婚一次,在布朗克斯。在申请结婚证时,她说这是她的“第一次和唯一的”婚姻。Barrientos,现在的39岁,面临两项刑事罪状,“在Westchester郡立了一个假的结婚工具,要提出头一级申请。”根据法院文件,2010年结婚许可证申请只有18天,检察官说婚姻是移民骗局的一部分。星期五,在布朗克斯州最高法院的男法官说,在离开法院之后,克里斯托弗·赖特(Christ Wright),她拒绝进一步发表意见。在离开法庭后,Barrientos被捕后,被偷窃服务和刑事。在据称进入纽约地铁的女律师共要进入了10年的婚姻。在离地铁共10年,在离婚期间,在离地铁共进行。'}]

## 中文翻译成英文

链接: https://huggingface.co/liam168/trans-opus-mt-zh-en

In [21]:
mode_name = 'liam168/trans-opus-mt-zh-en'

# 模型加载
en_to_zh_model = AutoModelWithLMHead.from_pretrained(mode_name)
en_to_zh_tokenizer = AutoTokenizer.from_pretrained(mode_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/310M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/299 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/807k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.75M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [22]:
translation_en = pipeline("translation_zh_to_en", 
                          model=en_to_zh_model, 
                          tokenizer=en_to_zh_tokenizer)

In [23]:
translation_en('我喜欢学习数据科学和机器学习。', max_length=400)

[{'translation_text': 'I like to study data science and machine learning.'}]

In [24]:
translation_en('听说你刷房子。', max_length=400)

[{'translation_text': 'I heard you brushed the house.'}]