In [1]:
import torch
print(torch.__version__)

2.9.1+cpu


### Transformers
Transformer library features, the library downloads the pre-trained models for natural language understanding (NLU Tasks) such as analyzing the sentiments of a text and NLG (Natural Language Generation), such as completing a prompt with new text or translating in another language.

First we will see how to easily leverage hte pipelines API to quickly use those pre-trained models at interface. Thus we will dig a little bitmore and see how the library gives access to those models and helps in preprocess the data

##### USE_CASE
1. Sentiment Analysis : Whether positive or negetive
2. Text Generation : Provide a prompt and model will generate what follows
3. Name Entity Recognition (NER) : In an input sentence, label each word with entity it represents (person, place)
4. Question Answering : Provide the model with same context and a question, extract the answers from the context
5. Filling masked text : Given a text with masked word and fill in the blanks
6. Summerization : Generate a summary of long text
7. Translation : Translates a text into another language
8. Feature Extraction : Return a tensor representation of the text

In [2]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
# Note: Transformer requires pytorch to be installed else will trow error "NameError: torch is not present"
# Note: Transformer does not support keras 3 yet so have to install tf-keras in the environment else will throw NameError

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.





Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


In [3]:
classifier('We are very happy to show you the transformer library')

[{'label': 'POSITIVE', 'score': 0.9998239874839783}]

when the above command is used, a pre-trained model and its tokenizer is downloaded and cached. As an introduction, the tokenizer's job is to preprocess the text for the model, which is then responsible for making predictions. The pipeline groups all of that together and post-process the predictions to make them readable

By default the model downloaded for the pipeline is "distillbert-bert-uncased-finetuned-sst-2-english". It uses the Distillbert architecture and has been finetuned on a dataset called SST-2 for the sentiment analysis task

In [5]:
results = classifier(
    [
        "We hope you like the food",
        "We are very happy to show you the transformer library",
        "Taste is not very good but is manageable",
        "I don't like to work hard",
        "I prefer to work smart"
    ]
)
for result in results:
    print(f"label:{result['label']}, with score of {round(result['score'],4)}")

label:POSITIVE, with score of 0.9998
label:POSITIVE, with score of 0.9998
label:POSITIVE, with score of 0.9994
label:NEGATIVE, with score of 0.9883
label:POSITIVE, with score of 0.9993


In [6]:
classifier("esperamos que no lo odie")

[{'label': 'POSITIVE', 'score': 0.9682278633117676}]

### model used for above :  [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)
Note: These models are present in Huggingface 

Now lets say that we want to use another model, for example that is trained on German data, we can search through the models in HuggingFace that gathers most pre-trained models done by research labs. 
For different language the model that can be used is [nlptown/bert-base-multilingual-uncased-sentiment](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) 

For that we need 2 classes 
1. AutoTokenizer
2. AutoModelForSequenceClassification or TFAutoModelForSequenceClassification

AutoTokenizer --> Takes the text data and convert it to some numerical data just like word2vec

In [9]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

##### Fine_Tune
Now if we don't find a model that has been pre-trained on same similar data as ours, then we need to fine-tune a pre-trained model on our data

Under the hood of pre-trained model
- First the Tokenizer is responsible for preprocessing the text. It will split the given text into words (or part of words, punctuation and symbols) usually called Tokens
- There are multiple rules that governs the process which is why we need to initialize the tokenizer using the name of the model ensuring that we are using the same rules using which the model was trained
- The next step is to convert these tokens into numbers and to be able to build the tensors out of them and feed to the model
- To do this the Tokenizer has a 'vocab' which is the part we download when we instantiate it with the 'from_pretrained' method, since we have to use the same vocab as when the model was pretrained

In [10]:
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline('sentiment-analysis', model = model, tokenizer = tokenizer)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`





TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Device set to use 0


In [11]:
classifier("I am a good developer")

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


[{'label': 'POSITIVE', 'score': 0.9998760223388672}]

In [12]:
inputs = tokenizer("We are very happy to show you the transformer library")
print(inputs)

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 10938, 2121, 3075, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The above line returns a dictionary string to list of ints containing the ids of the token

modifying the tokenizer for length and create batch for transformers

In [13]:
tf_batch = tokenizer(
    ["We are very happy to show you the transformer library"],
    padding= True,
    truncation = True,
    max_length = 512,
    return_tensors = 'tf'
)

for key, value in tf_batch.items():
    print(f"{key} : {value.numpy().tolist()}")

input_ids : [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 10938, 2121, 3075, 102]]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


### Other models used for multilingual :  [nlptown/bert-base-multilingual-uncased-sentiment](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)
Note: These models are present in Huggingface 