In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline("sentiment-analysis")

In [None]:
classifier([" i like you","i hate you"])

Pipeline has multiple components. 
token ids, vocab, model inference, convertig the logits to ids and then to tokens.

--> split the words into tokens
--> Mapping each token to an integer

we use the AutoTokenizer class and its from_pretrained() method to understand and apply
Tokenizer in standalone form. Otherwise its already part of the pipeline.
Each model has a specific tokenizer.


In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

now we can pass text directly to tokenizer and see what happens to it

In [None]:
tokenizer("i like you")

this gets fed into the model and embedding vectors are craeted from these ids

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the return_tensors argument: below is pytorch tensors
The main things to remember here are that you can pass one sentence or a list of sentences, as well as specifying the type of tensors you want to get back (if no type is passed, you will get a list of lists as a result).



In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation= True, return_tensors="pt")
inputs

We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an AutoModel class which also has a from_pretrained() method:

In [None]:
from transformers import AutoModel

In [None]:
model_cp = AutoModel.from_pretrained(checkpoint)

This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states
so whenever it mentions base model in the checkkpint name, high chance that
as an output we get hidden states. Hidden states are also called features.
which is essesentially ,for each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.
retrieve a high-dimensional vector representing is feature vector or hidden states.
these feartures are usually fed into some heads for downstream tasks 
or they can be used in unsupervised way as well.
he different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.


Usually the hidden vector or feature vector has three main components.
1. batch size 
2. sequence length
3. hidden dimension size such as 768 etc. Hidden size: The vector dimension of each model input.

[10, 128, 728] something like this.
It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more)

In [None]:
outputs =  model_cp(**inputs)
outputs

In [None]:
outputs.last_hidden_state.shape

Note that the outputs of 🤗 Transformers models behave like namedtuples or dictionaries. You can access the elements by attributes (like we did) or by key (outputs["last_hidden_state"]), or even by index if you know exactly where the thing you are looking for is (outputs[0]).

In [None]:
outputs[0].shape

In [None]:
outputs["last_hidden_state"].shape

Model heads: Making sense out of numbers:
The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

raw text-->tokenizer-->input ids-->embedding layers-->transformer blocks which includes attention layer
as well-->hidden state of features --> project to certain dimesnion using linear head
-->logits-->class or softmax

This is how the typical model looks like


In [None]:
from transformers import AutoModel


In [None]:
from transformers import AutoTokenizer

In [None]:
raw_inputs

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
tokenizer(raw_inputs)

In [None]:
tokenizer(raw_inputs, truncation= True, padding=True)

In [None]:
input = tokenizer(raw_inputs, truncation= True, padding=True, return_tensors="pt")

In [None]:
from transformers import AutoModel

In [None]:
model_cp = AutoModel.from_pretrained(checkpoint)

In [None]:
output = model_cp(**input)

In [None]:
output.last_hidden_state.shape

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
classification_full_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [None]:
output= classification_full_model(**input)

In [None]:
output

In [None]:
print(output.logits.shape)

Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

In [None]:
print(output.logits)

The values we get as output from our model don’t necessarily make sense by themselves.
Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one. 
Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model (out of the linear layer which gets projected from the hidden state, thus a head)

To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [None]:
import torch

In [None]:
predictions = torch.nn.functional.softmax(output.logits, dim=-1)

In [None]:
predictions

Now we can see that the model predicted [0.0402, 0.9598] for the first sentence and [0.9995, 0.0005] for the second one. These are the These are recognizable probability scores.

In [None]:
classification_full_model.config.id2label

Now we can conclude that the model predicted the following:
at 0 index we have prob for class 'NEGATIVE' and at 1 index we have prob for class 'POSITIVE'

First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing! Now let’s take some time to dive deeper into each of those steps.

In [None]:
from transformers import pipeline

In [None]:
input_raw_text = [" i am so happy",
                  "i am so so so so sad"]

In [None]:
sentiment_classification_pipeline = pipeline("sentiment-analysis")

In [None]:
pipeline_output = sentiment_classification_pipeline(input_raw_text)

In [None]:
pipeline_output = sentiment_classification_pipeline(input_raw_text)


In [None]:
pipeline_output

The above pipeline has done tokenization, embeddings, transformer attention blocks,
hidden space features, annd then clasifuaction head, and logits to raw probabalities
and then to actual labels all in one go. but we can do it in bits and pieces as well.


In [None]:
from transformers import AutoModel
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

In [None]:
# each tokenzer should be assocaited with a model as well, so we need to define model first
model_cp = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
# the base in the name shows that this is a base model which means
# it will only give us the hidden states or features as an output and not the actual classifition
# 
tokenizer = AutoTokenizer.from_pretrained(model_cp)

In [None]:
input_token_ids = tokenizer(input_raw_text, truncation = True, 
                              padding = True,
                              return_tensors = "pt")

In [None]:
input_token_ids

In [None]:
# now we have got input token ids, we can apply base model to them

sent_model = AutoModel.from_pretrained(model_cp)

In [None]:
output_sent = sent_model(**input_token_ids)
# this output_sent is the last hidden state

In [None]:
output_sent.last_hidden_state.shape

In [None]:
# now we need to pass it through classification head
sent_model_full = AutoModelForSequenceClassification.from_pretrained(model_cp)

In [None]:
output_sent_full = sent_model_full(**input_token_ids)

In [None]:
output_sent_full.logits

In [None]:
# we see its not the hidden state but the logits already been passed through classification head.

In [None]:
import torch
output_sent_prob = torch.nn.functional.softmax(output_sent_full.logits, dim =-1)

In [None]:
output_sent_prob

In [None]:
sent_model_full.config.id2label

The AutoModel class and all of its relatives are actually simple wrappers over the wide variety of models available in the library.
AutoModel class, which is handy when you want to instantiate any model from a checkpoint.

It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.
like below.

In [None]:
checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
at_model = AutoModel.from_pretrained(checkpoint)

In [None]:
at_model

In [None]:
from transformers import BertModel, BertConfig

In [None]:
config = BertConfig()

In [None]:
config

In [None]:
model = (config)

In [None]:
model

In [None]:
from transformers import BertConfig, BertModel
config = BertConfig()

In [None]:
print(config)

In [None]:
model_initial = BertModel(config)

In [None]:
model_initial
# Model is randomly initialized!

In [None]:
from transformers import BertModel
model_pretrained = BertModel.from_pretrained("bert-base-cased")
#this model is now with pretrained weights

In [None]:
model_pretrained

The weights have been downloaded and cached (so future calls to the from_pretrained() method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. 
you can customize your cache folder by setting the HF_HOME environment variable.

In [None]:
model_pretrained.save_pretrained("model_download")
# this will save the model, this saves two files 
#ls directory_on_my_computer

#config.json 
#pytorch_model.bin

In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]
# if i pass this through a tokenzier,. i will get input ids, like below.
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [None]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [None]:
# it will be a list of lists, list of list means rectangular shape 
# so this can be converted into tensors.

In [None]:
import torch

In [None]:
model_inputs = torch.tensor(encoded_sequences) # encoded_sequences should be list of list /rec

In [None]:
model_inputs

In [None]:
# so now we have converted token ids into its tensor, we should be able to pass throough
# embedding layer where each token id will be conveted into an embedding vector
# so additional dimention will be added.

In [None]:
# now we can pass these tensors through our model. lets pass these through  prrtrained model first
pretrained_outputs = model_pretrained(model_inputs)

In [None]:
pretrained_outputs.last_hidden_state.shape
# you can see it gives me only the hidden state. not the final logits.

In [None]:
from transformers import BertConfig, BertModel
config = BertConfig()
untrained_model = BertModel(config)

In [None]:
untrained_outputs = untrained_model(model_inputs)

In [None]:
untrained_outputs.last_hidden_state.shape

In [None]:
untrained_outputs

In [None]:
pretrained_outputs

In [None]:
# so we have loaded untrained model and pretrained model and passed our input through both.
# we haven't added any classifier head to create logits 
# but we an use these out hidden vectors as features to do unsupervised learning at this stage

While the model accepts a lot of different arguments, only the input IDs are necessary. We’ll explain what the other arguments do and when they are required later, but first we need to take a closer look at the tokenizers that build the inputs that a Transformer model can understand.



In [None]:
# tokenizers have only one function. convert the naturally occuring data into a format
# compatible with the model.

In [None]:
# the goal of tokenizer The goal is to find the most meaningful representation — 
# that is, the one that makes the most sense to the model — 
# and, if possible, the smallest representation.

In [None]:
"im am okay".split()

In [None]:
# we need a custom token to represent words that are not in 
# our vocabulary. This is known as the “unknown” token, 
# often represented as ”[UNK]” or ”<unk>”
#It’s generally a bad sign if you see that the tokenizer is 
# producing a lot of these tokens, as it wasn’t able to retrieve a 
# sensible representation of a word and you’re losing information along the way. 
#The goal when crafting the vocabulary is to do it in such a way 
# that the tokenizer tokenizes as few words as possible into the unknown token.


# One way to reduce the amount of unknown tokens 
# is to go one level deeper, using a character-based tokenizer.



In [None]:
# Character-based tokenizers split the text into characters, 
# rather than words. This has two primary benefits:

# The vocabulary is much smaller.
# There are much fewer out-of-vocabulary (unknown) tokens, 
# since every word can be built from characters.
# But here too some questions arise concerning spaces and punctuation:

In [None]:
# This approach isn’t perfect either. Since the representation is 
# now based on characters rather than words, one could argue that, 
# intuitively, it’s less meaningful: each character doesn’t mean a 
# lot on its own, whereas that is the case with words. However, this 
# again differs according to the language; in Chinese, for example, 
# each character carries more 
# information than a character in a Latin language.

# Another thing to consider is that we’ll end up with a 
# very large amount of tokens to be processed by our model:
# whereas a word would only be a single token with a word-based tokenizer, it can easily turn 
# into 10 or more tokens when converted into characters.

In [None]:
# To get the best of both worlds, we can use a third technique that 
# combines the two approaches: subword tokenization.

In [None]:
# Subword tokenization algorithms rely on the principle that 
# frequently used words should not be split into smaller subwords, 
# but rare words should be decomposed into meaningful subwords.

# For instance, “annoyingly” might be considered a rare word and 
# could be decomposed into “annoying” and “ly”. These are both likely 
# to appear more frequently as standalone subwords, while at the same time the 
# meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.
#These subwords end up providing a lot of semantic meaning:
# this allows us to have relatively good coverage with small 
# vocabularies, and close to no unknown tokens.

# This allows us to have relatively good coverage with small 
# vocabularies, and close to no unknown tokens.

# This approach is especially useful in agglutinative languages such 
# as Turkish, where you can form (almost) 
# arbitrarily long complex words by stringing together subwords.

In [None]:
# Unsurprisingly, there are many more techniques out there. To name a few:

# Byte-level BPE, as used in GPT-2
# WordPiece, as used in BERT
# SentencePiece or Unigram, as used in several multilingual models

In [None]:
from transformers import BertTokenizer
b_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [None]:
b_tokenizer

In [None]:
from transformers import AutoTokenizer
a_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
#BertTokenizer can only be used for specific models
#AutoTokenizer can be used for any model, so its general.
#we can use both

In [None]:
a_tokenizer("Using a Transformer network is simple")

In [None]:
b_tokenizer("Using a Transformer network is simple")

In [None]:
#

In [None]:
a_tokenizer.save_pretrained("directory")

In [None]:
#Translating text to numbers is known as encoding.
#  Encoding is done in a two-step process: the tokenization, 
# followed by the conversion to input IDs.

# so tokenization in itself is a two step process.
# tokens creation from input sequence and their input IDS.


# Tokenziaer is very specific to a model used.

In [None]:
# the second part is to convert obtain token ids which are numbers which are then
# converted into tensorss

In [None]:
token_ids =[[1,2,3],
            [4,5,7],
            [899,20]]


In [None]:
import torch
model_input = torch.tensor(token_ids)

In [None]:
token_ids =[[1,2,3],
            [4,5,7],
            [899,20, 20]]

In [None]:
import torch
model_input = torch.tensor(token_ids)

In [None]:
model_input

In [None]:
# you see that number of items in each must be same,
# model needs similar dimension input so padding.

In [None]:
# To get a better understanding of the two steps (token creation and toke ids assignment), we’ll explore them 
# separately. Note that we will use some methods that perform parts 
# of the tokenization pipeline separately to show you the intermediate 
# results of those steps, but in practice, you should call the 
# tokenizer directly on your inputs (as shown in the section 2).

In [None]:
from transformers import AutoTokenizer
tokenizer= AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
sequence = "Using a Transformer network is simple"

In [None]:
# this is both steps together
tokenizer(sequence)

In [None]:
# you can see the direct method gives you token ids and 

In [None]:
# this is token generation step. remember BERT use wordpiece tokenizer
# it is a sub word tokenzier for BERT.
# It has a vocab as like every tokenizer. it splits the words untill it finds a match i its vocab
tokens = tokenizer.tokenize(sequence)
print(tokens)

In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)

In [None]:
token_ids

In [None]:
# we see we got different toke ids, even when we use same tokenizer,
#The difference happens because:

#tokenizer(sequence) adds special tokens like [CLS] and [SEP]

#Manual tokenization (tokenize → convert_tokens_to_ids) does not add these special tokens

In [None]:
# the above process is called encoding. meaning going from text to tokens and token ids.

In [None]:
# going from token ids to tokens or sentences is decoding.

In [None]:
tokenizer.decode(token_ids)

In [None]:
token_ids

In [None]:
tokenizer(sequence, add_special_tokens=False)


In [None]:
output = tokenizer(sequence)

In [None]:
print(tokenizer.decode(output["input_ids"]))

In [None]:
output

In [None]:
# Note that the decode method not only converts the indices back to tokens, 
# but also groups together the tokens that were part of the same words to
# produce a readable sentence. This behavior will be extremely useful when 
# we use models that predict new text (either text generated from a prompt, 
# or for sequence-to-sequence problems                                                                                                                                                                                                                                  like translation or summarization).

In [None]:
# By now you should understand the atomic operations a tokenizer can handle: 
# tokenization, conversion to IDs, and converting IDs back to a string. However, 
# we’ve just scraped the tip of the iceberg. In the following section, we’ll take our approach to its limits 
# and take a look at how to overcome them.