In [35]:
from transformers import pipeline, set_seed
import pandas as pd
import soundfile as sf

In [26]:
set_seed(0)

In [5]:
classifer = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

text = "I am happy"
output = classifer(text)
pd.DataFrame(output)



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



Unnamed: 0,label,score
0,POSITIVE,0.99988


# Named Entity Recognition

## Token Classification Model

Token is a word or a character. In NER, we are interested in identifying the tokens that are entities.


In [9]:
ner_taegger = pipeline(task="ner", model="FacebookAI/xlm-roberta-large-finetuned-conll03-english")   # ner is an alias for token-classification
text_token = "My name is Darius. I work for Bixag Romania."

output_token = ner_taegger(text_token)
pd.DataFrame(output_token)

Some weights of the model checkpoint at FacebookAI/xlm-roberta-large-finetuned-conll03-english were not used when initializing XLMRobertaForTokenClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,entity,score,index,word,start,end
0,I-PER,0.999936,4,▁Dari,11,15
1,I-PER,0.999952,5,us,15,17
2,I-ORG,0.99999,10,▁Bi,30,32
3,I-ORG,0.999985,11,xa,32,34
4,I-ORG,0.99999,12,g,34,35
5,I-ORG,0.999988,13,▁Romania,36,43


# Question Answering

In [10]:
text = """
Dear Amazon, last week I ordered an Optimus Prime action figure from your
online store in India. Unfortunately when I opened the package, I discovered to
my horror that I had been sent an action figure of Megatron instead!
"""

In [17]:
reader = pipeline(task="question-answering")

question_customer = "What was wrong?"
question_customer_1 = "What did you order?"
question_customer_2 = "What did you receive?"


output_question = reader(question=question_customer, context=text)      # use default model, but we can change with use models=

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [18]:
pd.Series(output_question)

score                                               0.27679
start                                                   170
end                                                     222
answer    I had been sent an action figure of Megatron i...
dtype: object

# Summarization Pipeline

In [19]:
summary = pipeline(task="summarization")

output_summary = summary(text)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Your max_length is set to 142, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


In [20]:
output_summary

[{'summary_text': ' Amazon sent an Optimus Prime action figure from your online store in India . Unfortunately when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! Amazon.com has been sending a figure of Optimus Prime instead of Optimus Optimus Prime .'}]

In [24]:
output_summary[0]["summary_text"][len(text):]

'sending a figure of Optimus Prime instead of Optimus Optimus Prime .'

# Text Generation Pipeline

In [27]:
text

'\nDear Amazon, last week I ordered an Optimus Prime action figure from your\nonline store in India. Unfortunately when I opened the package, I discovered to\nmy horror that I had been sent an action figure of Megatron instead!\n'

In [29]:
generator = pipeline(task="text-generation")

response = "I am sorry to hear that your order wax mixed up"
prompt = f'User: {text} + f"Customer Service Response: {response}'

output_generator = generator(prompt, max_length=128)

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [32]:
output_generator

[{'generated_text': "User: \nDear Amazon, last week I ordered an Optimus Prime action figure from your\nonline store in India. Unfortunately when I opened the package, I discovered to\nmy horror that I had been sent an action figure of Megatron instead!\n\nCustomer Service Response: I am sorry to hear that your order wax mixed up or that you were not ordered exactly as described. The\n\naction figure arrived in a beautiful box inside a\n\nbox that I'm hoping will hold up great condition, especially when it comes in a\n\ndeluxe box. But I can make it look bad on you or have something to do"}]

# Translation Text Pipeline

In [33]:
translator = pipeline(task="translation_en_to_ro")

output_translator = translator(text)

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



In [34]:
output_translator

[{'translation_text': 'Stimate Amazon, săptămâna trecută am comandat o figură de acţiune Optimus Prime de la magazinul dvs. online din India şi, din păcate, când am deschis pachetul, am aflat cu groază că mi s-a trimis în schimb o figură de acţiune a Megatron!'}]

# Text to Speech Pipeline

In [40]:
text

'\nDear Amazon, last week I ordered an Optimus Prime action figure from your\nonline store in India. Unfortunately when I opened the package, I discovered to\nmy horror that I had been sent an action figure of Megatron instead!\n'

In [37]:
synth = pipeline(task="text-to-speech")

No model was supplied, defaulted to suno/bark-small and revision 645cfba (https://huggingface.co/suno/bark-small).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

  WeightNorm.apply(module, name, dim)
  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [38]:
speech = synth(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [39]:
sf.write("speech.wav", speech["audio"].T, samplerate=speech["sampling_rate"])