### Transformers Pipelines

The pipeline allows us to perform common NLP tasks such as text classification, named entity recognition (NER), question answering, summarization, translation, and more, with just a few lines of code. It automatically handles tasks such as tokenization, model loading, inference, and post-processing.

We can use pretrained models and can add out custom models

#### important args/params

there are some important params of transformers pipeline such as 
- task: NLP taks like sentiment-analysis, text-generation...
- model: pre trained model name...
- tokenizer: token genrator model
- feature_extractor: feature extractor model
- framework: pytorch or tf
- device: cpu/gpu(CUDA)
- max_length: max length on which computation will happen rest will be truncated, it is necessary to control memory usage during inference

#### how it works?

Transformers library's pipeline can be understood as a high-level abstraction that encapsulates the steps involved in using pre-trained models for various natural language processing (NLP) tasks.

#### it's components :-

- Task-Specific Model Loading: The pipeline selects and loads a pre-trained model that is specifically designed for the chosen NLP task. For example, if the task is sentiment analysis, the pipeline loads a pre-trained model that has been fine-tuned on sentiment analysis tasks.

- Tokenization: The input text is tokenized using the tokenizer associated with the loaded model. Tokenization involves breaking down the input text into smaller units such as words, subwords, or characters, depending on the tokenizer's configuration.

- Inference: The tokenized input is fed into the loaded model for inference. The model processes the input tokens through its layers, applying various transformations and computations to generate predictions or outputs specific to the chosen task.

- Post-Processing: The model's outputs are post-processed as necessary to obtain the final result. This may involve converting model outputs into human-readable formats, aggregating multiple outputs, or performing additional processing steps depending on the task requirements.

- Output: The final output of the pipeline is returned to the user. This output typically includes the predictions or results of the NLP task, such as sentiment labels, named entities, answers to questions, summaries of text, translations, etc.



### Tasks Pretrained Models

refer this to get all pre tained model and tasks https://huggingface.co/tasks


In [1]:
# Importing necessary libraries
from transformers import pipeline

In [2]:
# sementic analyzer
# Sentiment analysis is a natural language processing (NLP) technique used to determine the sentiment or emotional tone expressed in a piece of text. 
# It involves analyzing text data to identify and classify opinions, emotions, attitudes, or sentiments conveyed by the author. 
classifier = pipeline("sentiment-analysis",
                      model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", 
                      revision="af0f99b")

results = classifier(
    [
        "this is a positive thing to do.",
        "wat you did was not a good thing to do at all!....",
        "what you did wasn't bad."
    ]
)

results




[{'label': 'POSITIVE', 'score': 0.9998718500137329},
 {'label': 'NEGATIVE', 'score': 0.9997977614402771},
 {'label': 'POSITIVE', 'score': 0.9979287385940552}]

In [3]:
# text generator

generator = pipeline("text-generation",
                     model="openai-community/gpt2", 
                     revision="6c0e608")

results = generator([
    "Once upon a time",
    "He was a good guy"
],max_length=25, truncation=True)

results

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[[{'generated_text': 'Once upon a time, it became quite clear.\n\n"She\'s a young girl. She\'s a woman of twenty'}],
 [{'generated_text': 'He was a good guy," White said in his statement.\n\nWhite was a teammate of New York Giants coach Bruce B'}]]

In [4]:
# text summarizer

summarizer = pipeline("summarization",
                     model="sshleifer/distilbart-cnn-12-6", 
                     revision="a4f8f3e")
import sys 
# Reading some long text
with open('support_files/long_text.txt', 'r') as file:
    contents = file.read()
print([{"orignal":contents}])

result = summarizer(contents, max_length=56)

result

[{'orignal': "Large Language Models (LLMs) represent a groundbreaking advancement in natural language processing (NLP), revolutionizing the way machines understand and generate human-like text. These models, powered by deep learning algorithms and massive amounts of training data, have demonstrated remarkable capabilities in various NLP tasks, including text generation, translation, sentiment analysis, and more. At the forefront of LLMs are architectures like OpenAI's GPT (Generative Pre-trained Transformer) series, Google's BERT (Bidirectional Encoder Representations from Transformers), and other transformer-based models."}]


[{'summary_text': ' Large Language Models (LLMs) represent a groundbreaking advancement in natural language processing (NLP) These models are powered by deep learning algorithms and massive amounts of training data . LLMs have demonstrated remarkable capabilities in various NLP tasks, including text generation, translation, sentiment'}]

### Transformers Tokenizers


Tokenizers are essential in NLP for preprocessing text, segmenting words, standardizing representations, managing vocabularies, handling special tokens, utilizing subword tokenization, and ensuring computational efficiency. They play a crucial role in transforming raw text into a format suitable for analysis and input into NLP models.

We can use Pre trained tokens or we can create our own

In [5]:
# Using pre trained tokenizers!
from transformers import AutoTokenizer

# AutoTokenizer will automatically select the appropriate tokenizer based on the model name
# We will use bert-base-uncased
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [6]:
# some text we want to create tokens for 
text = "This is a smaple text :)"

# Tokenize the input text
encoded_input = tokenizer.encode(text, add_special_tokens=True)

encoded_input

[101, 2023, 2003, 1037, 15488, 9331, 2571, 3793, 1024, 1007, 102]

In [7]:
# we can decode the tokens using decode function
decoded_text = tokenizer.decode(encoded_input, skip_special_tokens=True)

decoded_text

'this is a smaple text : )'

### Creating our own tokenizer

To create our own tokenizer we will use BPE Byte Pair Encoding

It is a popular subword tokenization technique used in natural language processing (NLP) tasks.

In BPE, the input text is segmented into variable-length subword units. The segmentation is performed iteratively by merging the most frequent pairs of adjacent characters or character sequences. This process continues until a predefined vocabulary size is reached or until the desired number of merge operations is completed.

BPE effectively handles rare words, out-of-vocabulary words, and morphologically rich languages by allowing the model to learn subword representations that can capture meaningful linguistic patterns. It has been widely adopted in various NLP applications, including machine translation, text generation, and sentiment analysis, among others, due to its flexibility and effectiveness in handling different types of text data.

reffer https://huggingface.co/docs/tokenizers/en/components#models for more info on other models

In [8]:
# We will import Tokenizer BPE and other necessary imports
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Creating tokenizer
# We are using BPE and unk_token=[UNK] as we want to specifies the token to use for unknown words.
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer

<tokenizers.Tokenizer at 0x1a458f3c150>

In [9]:
# we will now create a trainer to train our tokenizer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# set pre_tokenizer to whitespace to tokenize text based on whitespaces
tokenizer.pre_tokenizer = Whitespace()

trainer

<tokenizers.trainers.BpeTrainer at 0x1a45acab570>

In [10]:
# We will download some datasets to train our tokenizer modle from hugging face
from datasets import load_dataset

# we will use wikitext-2-raw-v1
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1")

# We will have test, train and validation dataset
dataset

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

In [11]:
# Dataset sample
dataset["train"][:3]["text"], dataset["test"][:3]["text"], dataset["validation"][:3]["text"]

(['', ' = Valkyria Chronicles III = \n', ''],
 ['', ' = Robert Boulter = \n', ''],
 ['', ' = Homarus gammarus = \n', ''])

In [12]:

# Save the dataset to a file as tokenizer.train take two arguments files and trainer
dataset_path = "support_files/temp/dummy_dataset"

# Iterate over the dataset and save each example as a raw text file
for split in dataset.keys():
    split_data = dataset[split]
    for idx, example in enumerate(split_data):
        text = example["text"]
        with open(f"{dataset_path}/{split}.txt", "a", encoding="utf-8") as f:
            f.write(text)

In [13]:
# Load files and initialize trainer
files = [fr"./{dataset_path}/{split}.txt".replace("/", "\\").capitalize() for split in ["test", "train", "validation"]]
tokenizer.train(files, trainer)

In [14]:
# saving the trained tokenizer in a file
tokenizer_file_path = "./trained_data/tokenizer.json"
tokenizer.save(tokenizer_file_path)

# Now to load our trained tokenizer
tokenizer = Tokenizer.from_file(tokenizer_file_path)

outputs = tokenizer.encode("Fist custom trained from scratch tokenizer model 😁")

outputs.ids, outputs.tokens

([42, 1247, 8310, 8035, 1250, 16599, 1176, 3198, 14382, 6114, 0],
 ['F',
  'ist',
  'custom',
  'trained',
  'from',
  'scratch',
  'to',
  'ken',
  'izer',
  'model',
  '[UNK]'])

In [15]:
# print the last ofset in encoded sentence
"Fist custom trained from scratch tokenizer model 😁"[int(outputs.offsets[-1][0]): int(outputs.offsets[-1][1])]

'😁'

In [16]:
# To add special token automaticly
tokenizer.token_to_id("[SEP]")

# to set the post-processing to give us the traditional BERT inputs
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

output = tokenizer.encode("Fist custom trained from scratch tokenizer model 😁")
output.tokens

['[CLS]',
 'F',
 'ist',
 'custom',
 'trained',
 'from',
 'scratch',
 'to',
 'ken',
 'izer',
 'model',
 '[UNK]',
 '[SEP]']