# LLM Course

## Huggingface pipeline

`pip install datasets evaluate transformers[sentencepiece]`

```
目前 可用的一些pipeline 有：

eature-extraction （获取文本的向量表示）
fill-mask （完形填空）
ner （命名实体识别）
question-answering （问答）
sentiment-analysis （情感分析）
summarization （提取摘要）
text-generation （文本生成）
translation （翻译）
zero-shot-classification （零样本分类）
```

### pipline能完成的基础任务

In [8]:
from transformers import pipeline

In [9]:
### 文本情感分类

classifier = pipeline('sentiment-analysis')
classifier(["我不是特别喜欢数学","他看起来好像是个好人的样子"])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.8201040625572205},
 {'label': 'NEGATIVE', 'score': 0.9013534188270569}]

In [10]:
# 零样本分类

classifier = pipeline("zero-shot-classification")
classifier(
    "这是一门关于transformers库的课程，面向本专业所有计算机学生",
    candidate_labels=["教育","政治","商业"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


{'sequence': '这是一门关于transformers库的课程，面向本专业所有计算机学生',
 'labels': ['教育'],
 'scores': [0.08812970668077469]}

In [16]:
# 文本生成功能

generator = pipeline("text-generation")
generator("In this course, we will teach you how to",num_return_sequences=2,max_length=15)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to easily create a custom project'},
 {'generated_text': 'In this course, we will teach you how to create and use virtual machines'}]

In [18]:
# 加载hg的其他模型

generator = pipeline("text-generation",model="distilgpt2")
generator(
    "In this course, we will teach you how to"
    ,num_return_sequences=2
    ,max_length=30
    )

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use the language in Windows. If you have any questions, feel free to drop any suggestions on the'},
 {'generated_text': 'In this course, we will teach you how to install Android on the Nexus 5/5 Nexus 5 by using the code below:\n\n\n\n'}]

In [22]:
# 完型填空

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.",top_k=2)  # top_k 控制显示的结果数量

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.1919846385717392,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04209209978580475,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [36]:
# 命名实体识别

ner = pipeline("ner",grouped_entities=True)
ner("my name is Tangrui, and I am studying Computer Science in NJU")
# grouped_entities=True 参数告诉 pipeline 将与同一实体对应的句子部分重新分组：这里模型正确地将“computer”和“science”分组为一个组织，即使名称由多个词组成

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.98813444,
  'word': 'Tangrui',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9869808,
  'word': 'NJU',
  'start': 58,
  'end': 61}]

In [28]:
# 问答  只能从上下文获取答案

answer = pipeline("question-answering")
answer(
    question="how does lily feel about homework"
    ,context="lily does not like doing homework"
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.38847437500953674,
 'start': 5,
 'end': 33,
 'answer': 'does not like doing homework'}

In [40]:
#生成摘要
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
    """
    ,max_length=30
    ,min_length=20
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as'}]

In [None]:
# 翻译任务

# 法语和英语翻译
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")