# Replicating The pipeline() Function Functionality

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "It is really hot here in Italy and going outside means being drenched in your own sweat and it does not feel good.",
        "The weather in my hometown during summer is really nice!",
    ]
)

When we run the above cell, we obtain
```
[{'label': 'NEGATIVE', 'score': 0.9997081160545349},
 {'label': 'POSITIVE', 'score': 0.9998784065246582}]
```
Now under the hood, the pipeline function is conducting all the tasks neccessary to use the specified model (in this case it is the default model, which is *distilbert-base-uncased-finetuned-sst-2-english*) for the prediction. The steps are, *preprocessing (tokenisation)*, *then passing that input to the model*, and finally, *postprocessing*:
<br />
<br />
![Pipeline Steps](data/chapter_2/pipeline_steps.png "The Three Steps!")

In [None]:
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

### Preprocessing with a tokenizer

Every model have their own way of splitting the text input into tokens, mapping them to interger and additing any additional input value that is needed for the model to process. Therefore, when fine_tuning/inference a pretrained model one needs to make sure the tokenisation methods fellows the same rule and it can be done by using the `from_pretrained()` method by the `AutoTokenizer` class. 

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

now that we have the right tokeniser, we need to preprocess the raw input using it and pass the tensors (in simple term a tensor is basically an array which follows some strict rules) retuned by the tokeniser to the model. In the example below, we are asking the tokeniser to tokenise the raw input  and perform:
- *padding* - adding non attention tokens to the shorter length text so that the length of all the input text become equal
- *truncation* - remove the input text words that exceeds the input text length limit of the model, and 
- *return type* - the type of tensor to return (`return_tensors='pt'`, where `pt` means *pytorch tensor*).

> Note: the use of *padding* and *truncation* mainly depends on the type of return object you asked from the tokeniser. If you would want a tensor, then having *padding* and *truncation* set is the right way because tensor only accept rectangular shapes (think matrices).


In [None]:
raw_inputs = [
    "It is really hot here in Italy and going outside means being drenched in your own sweat and it does not feel good.",
    "The weather in my hometown during summer is really nice!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='pt')
print(inputs)

As we can see, the output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. 
- *input_ids*, numerical representations of your tokens. Here it contains two rows of integers (one for each sentence) and,
- *attention_mask*, in simple term is to tell the model later which tokens needs attention.

>  Note: there is also one key value instance that you get from the tokeniser, *token_type_ids*, these tell the model which part of the input is sentence A and which is sentence B.

### Going through the model
We use the `AutoModel` class this time to load the model by using its `from_pretrained()` function. And this way of using the basic *AutoModel* class, when we pass the input to the model it only returns a *hidden states*/*features*/*high-dimensional vector* which represents the contextual understanding of that input by the Transformer model and not a solid quantity that can be analysed, and that is beacause the model this way does not have the specific head that can convert the *hidden states*/*features*/*high-dimensional vector* into a quantity that can be analysed.

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained(model_checkpoint)

The returned *hidden state*, generally have three dimensions:
- Batch size: The number of sequences processed at a time (2 in our example).
- Sequence length: The length of the numerical representation of the sequence (16 in our example).
- Hidden size: The vector dimension of each model inputb, i.e., basically the output values of the model's *hidden states*/*features*/*high-dimensional vector*

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

### Model heads: Making sense out of numbers
So, a model head basically take the *hidden states*/*features*/*high-dimensional vector* model output (in this case the size of it is *768*) and project them onto a different dimension that can be analysed.
<br />
<br />
![A Model Worflow](data/chapter_2/model_workflow.png "A Model Worflow")
>In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

Now, instead of *AutoModel* class, we will use the *AutoModelForSequenceClassification* which comes with the sequence classification head.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
outputs = model(**inputs)

print(outputs.logits.shape)

Now, because the model already comes with a head this time, the dimensionality of the output will be much lower: now the outputs vectors containing two values (one per label, i.e., Negative, Positive).ince we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

### Postprocessing The Output

In [None]:
print(outputs.logits)

However, even now the model's output does not make sense and that's because the outputs are *logits* (a logit is a raw, unnormalized score) and to convert the *logits* to the right numerical value we have to pass them through a *linear/non linear activation function* (i.e., a function that can normalise them), for example, **softmax**.

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

Now we can see that the model predicted `[0.9971, 0.0002]` for the first sentence and `[0.0001, 0.9998]` for the second one, these are recognizable probability scores. Now, to get the labels corresponding to each position, we can inspect the id2label attribute of the model config.

In [None]:
model.config.id2label

Now we can conclude that the model predicted the following:

- First sentence: NEGATIVE: 0.9971, POSITIVE: 0.0002 -> NEGATIVE
- Second sentence: NEGATIVE: 0.0001, POSITIVE: 0.9998 -> POSITIVE

## Model

Similarly to:
```
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")
```

We can also directly load the model if we know the right model name.
Furthermore, can also save a model in our local machine by calling the `save_pretrained()` method of the model.

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

outputs  = model(**inputs)

# save the model
model.save_pretrained('data/chapter_2/bert_model')

print(outputs.last_hidden_state.shape)

Two files get saved in the given directory *config.json* and *model.safetensors* that defines the saved model:
- *config.json* - It stores all the necessary attributes needed to build the model architecture.
- *model.safetensors* - It's the state dictionary; it contains all your model’s weights.


Furthermore, we can load the model simply b calling the `from_pretrained()` method of `AutoModel`

In [None]:
model = AutoModel.from_pretrained('data/chapter_2/bert_model')

# Tokenisation Encoding Text

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
encoded_input = tokenizer("How are you?", "How's your holiday going?")
print(encoded_input)

 Here we can see when we encode multiple sentences at once, the *token_type_ids* is also returned to specify which token id belongs to which sentence and that is because if we try to decode the input ids back to their orignal state of two sentences. It comes as a one long single sentence with some custom special tokens `[CLS]` and `[SEP]`, which are model specific and mostly used to tell the model the starting and ending of a sequence to be processed.

In [None]:
tokenizer.decode(encoded_input["input_ids"])

Similarly to loading a know model directly from its class we can load a tokeniser as well without needing the `AutoModel` class. And similarly we can call the `save_pretrained()` to save the tokeniser.

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

input = tokenizer("I stay till late in the library because it is open till late and I am not so productive when staying back in my room.")

tokenizer.save_pretrained("data/chapter_2/tokeniser")

print(input)

# Tokeniser Pipeline
!["Tokeniser Workflow"](data/chapter_2/tokeniser_workflow.png  "The Tokeniser Workflow")