# Replicating The pipeline() Function Functionality

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "It is really hot here in Italy and going outside means being drenched in your own sweat and it does not feel good.",
        "The weather in my hometown during summer is really nice!",
    ]
)

When we run the above cell, we obtain
```python
[{'label': 'NEGATIVE', 'score': 0.9997081160545349},
 {'label': 'POSITIVE', 'score': 0.9998784065246582}]
```
Now under the hood, the pipeline function is conducting all the tasks neccessary to use the specified model (in this case it is the default model, which is *distilbert-base-uncased-finetuned-sst-2-english*) for the prediction. The steps are, *preprocessing (tokenisation)*, *then passing that input to the model*, and finally, *postprocessing*:
<br />
<br />
![Pipeline Steps](data/chapter_2/pipeline_steps.png "The Three Steps!")

In [None]:
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

### Preprocessing with a tokenizer

Every model have their own way of splitting the text input into tokens, mapping them to interger and additing any additional input value that is needed for the model to process. Therefore, when fine_tuning/inference a pretrained model one needs to make sure the tokenisation methods fellows the same rule and it can be done by using the `from_pretrained()` method by the `AutoTokenizer` class. 

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

now that we have the right tokeniser, we need to preprocess the raw input using it and pass the tensors (in simple term a tensor is basically an array which follows some strict rules) retuned by the tokeniser to the model. In the example below, we are asking the tokeniser to tokenise the raw input  and perform:
- *padding* - adding non attention tokens to the shorter length text so that the length of all the input text become equal
- *truncation* - remove the input text words that exceeds the input text length limit of the model, and 
- *return type* - the type of tensor to return (`return_tensors='pt'`, where `pt` means *pytorch tensor*).

> Note: the use of *padding* and *truncation* mainly depends on the type of return object you asked from the tokeniser. If you would want a tensor, then having *padding* and *truncation* set is the right way because tensor only accept rectangular shapes (think matrices).


In [None]:
raw_inputs = [
    "It is really hot here in Italy and going outside means being drenched in your own sweat and it does not feel good.",
    "The weather in my hometown during summer is really nice!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='pt')
print(inputs)

As we can see, the output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. 
- *input_ids*, numerical representations of your tokens. Here it contains two rows of integers (one for each sentence) and,
- *attention_mask*, in simple term is to tell the model later which tokens needs attention.

>  Note: there is also one key value instance that you get from the tokeniser, *token_type_ids*, these tell the model which part of the input is sentence A and which is sentence B.

### Going through the model
We use the `AutoModel` class this time to load the model by using its `from_pretrained()` function. And this way of using the basic *AutoModel* class, when we pass the input to the model it only returns a *hidden states*/*features*/*high-dimensional vector* which represents the contextual understanding of that input by the Transformer model and not a solid quantity that can be analysed, and that is beacause the model this way does not have the specific head that can convert the *hidden states*/*features*/*high-dimensional vector* into a quantity that can be analysed.

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained(model_checkpoint)

The returned *hidden state*, generally have three dimensions:
- Batch size: The number of sequences processed at a time (2 in our example).
- Sequence length: The length of the numerical representation of the sequence (16 in our example).
- Hidden size: The vector dimension of each model inputb, i.e., basically the output values of the model's *hidden states*/*features*/*high-dimensional vector*

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

### Model heads: Making sense out of numbers
So, a model head basically take the *hidden states*/*features*/*high-dimensional vector* model output (in this case the size of it is *768*) and project them onto a different dimension that can be analysed.
<br />
<br />
![A Model Worflow](data/chapter_2/model_workflow.png "A Model Worflow")
>In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

Now, instead of *AutoModel* class, we will use the *AutoModelForSequenceClassification* which comes with the sequence classification head.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
outputs = model(**inputs)

print(outputs.logits.shape)

Now, because the model already comes with a head this time, the dimensionality of the output will be much lower: now the outputs vectors containing two values (one per label, i.e., Negative, Positive).ince we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

### Postprocessing The Output

In [None]:
print(outputs.logits)

However, even now the model's output does not make sense and that's because the outputs are *logits* (a logit is a raw, unnormalized score) and to convert the *logits* to the right numerical value we have to pass them through a *linear/non linear activation function* (i.e., a function that can normalise them), for example, **softmax**.

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

Now we can see that the model predicted `[0.9971, 0.0002]` for the first sentence and `[0.0001, 0.9998]` for the second one, these are recognizable probability scores. Now, to get the labels corresponding to each position, we can inspect the id2label attribute of the model config.

In [None]:
model.config.id2label

Now we can conclude that the model predicted the following:

- First sentence: NEGATIVE: 0.9971, POSITIVE: 0.0002 -> NEGATIVE
- Second sentence: NEGATIVE: 0.0001, POSITIVE: 0.9998 -> POSITIVE

## Model

Similarly to:
```python
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")
```

We can also directly load the model if we know the right model name.
Furthermore, can also save a model in our local machine by calling the `save_pretrained()` method of the model.

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

outputs  = model(**inputs)

# save the model
model.save_pretrained('data/chapter_2/bert_model')

print(outputs.last_hidden_state.shape)

Two files get saved in the given directory *config.json* and *model.safetensors* that defines the saved model:
- *config.json* - It stores all the necessary attributes needed to build the model architecture.
- *model.safetensors* - It's the state dictionary; it contains all your model’s weights.


Furthermore, we can load the model simply b calling the `from_pretrained()` method of `AutoModel`

In [None]:
model = AutoModel.from_pretrained('data/chapter_2/bert_model')

# Tokenisation Encoding Text

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
encoded_input = tokenizer("How are you?", "How's your holiday going?")
print(encoded_input)

 Here we can see when we encode multiple sentences at once, the *token_type_ids* is also returned to specify which token id belongs to which sentence and that is because if we try to decode the input ids back to their orignal state of two sentences. It comes as a one long single sentence with some custom special tokens `[CLS]` and `[SEP]`, which are model specific and mostly used to tell the model the starting and ending of a sequence to be processed.

In [None]:
tokenizer.decode(encoded_input["input_ids"])

Similarly to loading a know model directly from its class we can load a tokeniser as well without needing the `AutoModel` class. And similarly we can call the `save_pretrained()` to save the tokeniser.

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

input = tokenizer("I stay till late in the library because it is open till late and I am not so productive when staying back in my room.")

tokenizer.save_pretrained("data/chapter_2/tokeniser")

print(input)

# Tokeniser Pipeline
We also know by now what happens when we call the `tokenzer()` to convert a text into machine readable input format. The tokeniser takes the raw text, break it down into tokens (depending on the tokensier vocabulary), add any special tokens depending on the model and then finally convert the tokens into input ids. 
<br />
<br />
!["Tokeniser Workflow"](data/chapter_2/tokeniser_workflow.png  "The Tokeniser Workflow")

### Splitting text into tokens
We can use the `tokenize()` method of a given tokeniser to simply convert the raw text into the tokensier vocabulary based tokens. 

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [None]:
tokens = tokenizer.tokenize("How was your day today and how's the weather there?")
print(tokens)

different tokeniser models follow different rules and conventions when converting the raw text into tokens. The example below we can see that `albert-base-v1` likes to put '*_*' infront of all the tokens that have space in front of them.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('albert-base-v1')
tokens = tokenizer.tokenize("You know in Florence, in the summer it is really humid!")
print(tokens)

### token mapping
to convert the tokens to their respective token ids we can use the `convert_tokens_to_ids()` function of the same tokeniser we used to convert the raw text to tokens.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("Can't think of an example or what to ask")
print(tokens)
input_ids = tokenizer.convert_tokens_to_ids(tokens)

print(input_ids)

we can see that here we are atill missing the special tokens and one can add them by simply calling the `prepare_for_model()` method of that tokeniser. So,

In [None]:
finalised_input = tokenizer.prepare_for_model(input_ids)
print(finalised_input)

decoded_text = tokenizer.decode(finalised_input['input_ids'])
print(decoded_text)

# Handling Multiple Sequences

When we are using a transformer model, we need to make sure that the input we passed to the main needs to be in a batch (i.e., inside a list) no matter the total number of actual text sequences to process by it. And that is because the architure of a transformer model is built in a way that it only accepts a batch of inputs as a input. 

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = 'The weather after midnight is really nice!'

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
ids = tokenizer.prepare_for_model(ids) 
input_ids = torch.tensor(ids['input_ids'])

print(input_ids)

model(input_ids)

in the code above the last model line failed because it was expecting a input like `tensor([[input ids, ..], ....])` (i.e., a tensor batch) but it received a single dimension `tensor([input ids, ..])`. Therefore we need to add the input_ids in a list first and then convert it to a torch tensor and then pass it to the model.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = 'The weather after midnight is really nice!'

tokens = tokenizer.tokenize(sequence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
ids = tokenizer.prepare_for_model(token_ids) 
input_ids = torch.tensor([ids['input_ids']])

print(input_ids)
model(input_ids)

So now we can add more input ids to our tensor batch and the model will predict for them as well.

In [None]:
input_ids = torch.tensor(
    [ids['input_ids'], ids['input_ids']]
)

print(input_ids)
model(input_ids)

Now if you remember earlier we also talked about padding and turnication. And to find out what kind of padding token we need to add for shorter text input and how long a sequence needs to be before turnication, we have to call the tokeniser `tokenizer.pad_token_type_id`, `tokenizer.special_tokens_map['pad_token']` and `tokenizer.max_len_single_sentence`. However, then remember in the case of padding we also have to pass the attention mask so that the model will know to not give attention to the padding tokens.

In [None]:
padding_id = tokenizer.pad_token_type_id
padding_token = tokenizer.special_tokens_map['pad_token']
max_length = tokenizer.max_len_single_sentence
print(f"padding token '{padding_token}' and max length per sequence '{max_length}'")

For simplicity, in the below example we will only look at padding - based on max input sequence, turnication - based on max tokenizer allowd length, and attention mask but only considering the `[PAD]` token.

In [None]:
sequences = ['The weather after midnight is really nice!', 'How are you?']

# turnicate based on the allowed max length
sequences = [sequence[:max_length] for sequence in sequences]

# converting to tokens
sequences_tokens = [tokenizer.tokenize(sequence) for sequence in sequences]
print(f"token:\n{sequences_tokens}")

# add padding based on max token length sequnece
max_token_length_sequence  = len(max(sequences_tokens, key=len))
for sequence_tokens in sequences_tokens:
    if len(sequence_tokens) == max_token_length_sequence:
        continue
    
    num_of_pad_token = max_token_length_sequence - len(sequence_tokens)
    sequence_tokens.extend([padding_token]*num_of_pad_token)

print(f"Padded tokens: {sequences_tokens}")

# converting to token ids
sequences_token_ids = [tokenizer.convert_tokens_to_ids(sequence_token) for sequence_token in sequences_tokens]
print(f"Token ids :\n{sequences_token_ids}")

# adding special character and getting the attention mask
inputs = [tokenizer.prepare_for_model(sequence_token_ids) for sequence_token_ids in sequences_token_ids]
    
print(f"Added special characters and attention mask:\n{inputs}")

Here if we look at the attention mask for the second sequence it has 1 for all the token ids but that is wrong because we know we have added some padding ids there and instead of this `[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]` it should look like `[1, 1, 1, 1, 1, 0, 0, 0, 0, 1]`. This is one of drawback of doing all of this manually that you also have to create the attention mask manually with having special character in mind.

In [None]:
# fixing attention mask
for i in range(len(inputs)):
    attention_mask = []
    for input_id in inputs[i]['input_ids']:
        if input_id == padding_id:
            attention_mask.append(0)
        else:
            attention_mask.append(1)

    # update the attention_mask
    inputs[i]['attention_mask'] = attention_mask


print(f"Updated inputs:\n {inputs}")

    

In [None]:
batched_ids = torch.tensor([item['input_ids'] for item in inputs])
attention_masks = torch.tensor([item['attention_mask'] for item in inputs])

outputs = model(batched_ids, attention_mask=attention_masks)
print(f"Logits:\n\t{outputs.logits}\n\n")

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(f"Prediction {model.config.id2label}:\n\t {predictions}")

So,

| Sequence | Negative | Positive | Prediction |
|----------|----------|----------|------------|
|    1     |   0.001  |    0.9   | Positive   |
|    2     |   0.9    |    0.02  | Negative   |

# Putting It All Togther

Now that we have seen all the individual steps, we actually don't need to remember them because the *transformers* API methods handles all of this of us if we call them directly instead of calling their in-built methods.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = ['The weather after midnight is really nice!', 'How are you?']

# the padding can be set to "True"/"longest" and "max-length" with seprate max_length parameter which takes an int
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors='pt')
output = model(**tokens)
print(output.logits)

prediction = torch.nn.functional.softmax(output.logits, dim=-1)
print(prediction)

> Note: for  `torch.nn.functional.softmax(output.logits, dim=-1)` by specifying `dim=-1` we are telling the softmax function to perform the softmax on the last dimension of the tensor which are  `[-4.1945,  4.5471]` and `[1.9830, -1.5163]` in out example above.

# Optimized Inference Deployment

Now that we know the fundamentals of how the LLM works. Let's now have a quick overview of how can you deploy these model in production enviorment for users to use them for inference.

There are three advanced frameworkds for optimising LLM deployments:
- **Text Generation Inference** (**TGI**)
- **vLLM**
- **llama.cpp**

|           |  Memory Management and Performance  | Deployment and Integration  |
|-----------|-------------------------------------|-----------------------------|
| *TGI*       | Uses **Flash Attention 2**, optimizing memory and speed, suitable for fast inference. | Built-in production-ready monitoring and security, stable and secure deployments.  |
| *vLLM*      | Uses **PagedAttention**, enabling efficient memory management for faster inference on large models. | Highly customizable and developer-friendly, good for scenarios requiring adaptability and optimization.  |
| *llama.cpp* | Utilizes **quantization techniques**, making it suitable for resource-limited environments. | Focuses on simplicity, ease of use, and portability, ideal for lightweight, diverse deployments. |

If you would like to see how to deploy using these methods have a look at [HuggingFace - Optimized Inference Deployment](https://huggingface.co/learn/llm-course/chapter2/8?fw=pt).

# The End!