# NLP

## Text classification

In [2]:
import os
from transformers import pipeline

os.environ['HF_HOME'] = '/mnt/d/hf'
os.environ['HF_HUB_CACHE'] = '/mnt/d/hf/hub'

In [3]:
pipe1 = pipeline(task="sentiment-analysis", model='SamLowe/roberta-base-go_emotions')

Downloading config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [4]:
text_list = [
    "Today Shanghai is really cold.",
    "I think the taste of the garlic mashed pork in this store is average.",
    "You learn things really quickly. You understand the theory class as soon as it is taught.",
    "I am not having a great day"
]
pipe1(text_list)

[{'label': 'neutral', 'score': 0.8033106923103333},
 {'label': 'disapproval', 'score': 0.5022488236427307},
 {'label': 'approval', 'score': 0.44847190380096436},
 {'label': 'disappointment', 'score': 0.46669524908065796}]

## Token classification

In [5]:
classifier = pipeline(task="ner", model="dslim/bert-base-NER", grouped_entities=True)

Downloading config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



In [7]:
preds = classifier("Hugging Face is a French company based in New York City.")
preds

[{'entity_group': 'ORG',
  'score': 0.9287412,
  'word': 'Hugging Face',
  'start': 0,
  'end': 12},
 {'entity_group': 'MISC',
  'score': 0.9996295,
  'word': 'French',
  'start': 18,
  'end': 24},
 {'entity_group': 'LOC',
  'score': 0.9994915,
  'word': 'New York City',
  'start': 42,
  'end': 55}]

## Question Answering

In [8]:
qa = pipeline(task="question-answering", model="deepset/roberta-base-squad2")

Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [9]:
preds = qa(
    question="What is the name of the repository?",
    context="The name of the repository is huggingface/transformers",
)
print(
    f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

score: 0.9068, start: 30, end: 54, answer: huggingface/transformers


In [10]:
preds = qa(
    question="What is the capital of China?",
    context="On 1 October 1949, CCP Chairman Mao Zedong formally proclaimed the People's Republic of China in Tiananmen Square, Beijing.",
)
print(
    f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

score: 0.7504, start: 115, end: 122, answer: Beijing


## Summarization

In [11]:
from transformers import pipeline

summarizer = pipeline(task="summarization",
                      model="ARTeLab/it5-summarization-mlsum",
                      min_length=8,
                      max_length=32
)

Downloading config.json:   0%|          | 0.00/703 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/1.91k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.02M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

In [12]:
summarizer(
    """
    In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, 
    replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. 
    For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. 
    On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. 
    In the former task our best model outperforms even all previously reported ensembles.
    """
)


[{'summary_text': 'our best model outperforms even all previously reported ensembles.'}]

In [13]:
summarizer(
    '''
    Large language models (LLM) are very large deep learning models that are pre-trained on vast amounts of data. 
    The underlying transformer is a set of neural networks that consist of an encoder and a decoder with self-attention capabilities. 
    The encoder and decoder extract meanings from a sequence of text and understand the relationships between words and phrases in it.
    Transformer LLMs are capable of unsupervised training, although a more precise explanation is that transformers perform self-learning. 
    It is through this process that transformers learn to understand basic grammar, languages, and knowledge.
    Unlike earlier recurrent neural networks (RNN) that sequentially process inputs, transformers process entire sequences in parallel. 
    This allows the data scientists to use GPUs for training transformer-based LLMs, significantly reducing the training time.
    '''
)


[{'summary_text': 'gpu e gpu consentono di testare le llm e le llm'}]

# Audio

## Audio classification

In [14]:
classifier = pipeline(task="audio-classification", model="superb/wav2vec2-base-superb-sid")

Downloading config.json:   0%|          | 0.00/54.9k [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of the model checkpoint at superb/wav2vec2-base-superb-sid were not used when initializing Wav2Vec2ForSequenceClassification: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at superb/wav2vec2-base-superb-sid and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_

Downloading (…)rocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

In [15]:
preds = classifier("../LLM-quickstart/transformers/data/audio/mlk.flac")
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds

[{'score': 0.9856, 'label': 'id10151'},
 {'score': 0.0085, 'label': 'id11068'},
 {'score': 0.003, 'label': 'id10979'},
 {'score': 0.0014, 'label': 'id10522'},
 {'score': 0.0007, 'label': 'id10443'}]

## Speech recognition

In [16]:
from transformers import pipeline

# 使用 `model` 参数指定模型
transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [18]:
text = transcriber("../LLM-quickstart/transformers/data/audio/mlk.flac")
text

{'text': ' I have a dream. Good one day. This nation will rise up. Live out the true meaning of its dream.'}

# Vision

## Image classification

In [19]:
from transformers import pipeline

classifier = pipeline(task="image-classification", model="microsoft/resnet-50")

Downloading config.json:   0%|          | 0.00/69.6k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/103M [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.


In [20]:
preds = classifier(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
print(*preds, sep="\n")

{'score': 0.5874, 'label': 'lynx, catamount'}
{'score': 0.1289, 'label': 'tabby, tabby cat'}
{'score': 0.075, 'label': 'marmot'}
{'score': 0.0382, 'label': 'badger'}
{'score': 0.0131, 'label': 'Egyptian cat'}


In [22]:
preds = classifier(
    "../LLM-quickstart/transformers/data/image/panda.jpg"
)
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
print(*preds, sep="\n")

{'score': 0.9768, 'label': 'giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca'}
{'score': 0.0088, 'label': 'indri, indris, Indri indri, Indri brevicaudatus'}
{'score': 0.0004, 'label': 'groenendael'}
{'score': 0.0003, 'label': 'Siberian husky'}
{'score': 0.0002, 'label': 'malamute, malemute, Alaskan malamute'}
