<a href="https://colab.research.google.com/github/MRamya-sri/TRANSFORMERS/blob/main/Working.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Working of Basic AutoTokenizer and TFAutoModelForSequenceClassification from the Hugging Face Transformers library.

---





**AutoTokenizer**

AutoTokenizer is a versatile and convenient class provided by the Hugging Face Transformers library. It automatically selects the appropriate tokenizer for a given model. Tokenizers are essential for transforming raw text into a format that models can understand, typically converting text into token IDs, attention masks, and other necessary inputs.

Method as follows:

**1. Loading a Pre-trained Tokenizer**

In [1]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

**2. Tokenization**

In [6]:
inputs = tokenizer("Hello, world!", return_tensors="tf")


In [7]:
inputs

{'input_ids': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[ 101, 7592, 1010, 2088,  999,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1]], dtype=int32)>}

This converts the input text into token IDs and returns them as TensorFlow tensors. The return_tensors parameter can be set to "tf" for TensorFlow, "pt" for PyTorch, or "np" for NumPy.

**3. Batch Tokenization**

In [8]:
texts = ["Hello, world!", "Transformers are great!"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="tf")


This tokenizes a batch of texts, adding padding and truncation to ensure uniform input length.

In [9]:
inputs

{'input_ids': <tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[  101,  7592,  1010,  2088,   999,   102],
       [  101, 19081,  2024,  2307,   999,   102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 6), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1]], dtype=int32)>}

**4. Decoding Tokens**

In [10]:
decoded_text = tokenizer.decode(inputs["input_ids"][0])
decoded_text


'[CLS] hello, world! [SEP]'

This converts token IDs back into human-readable text.


**Whole process of AutoTokenizer Summary**

**Tokenization Process:**

**1. Splitting Text into Tokens:** Tokenizers split the input text into smaller units called tokens (words, subwords, or characters).

**2. Mapping Tokens to IDs:** Each token is mapped to a unique integer ID.

**3. Creating Attention Masks:** Attention masks indicate which tokens should be attended to (1) and which should be ignored (0), useful for padding.

**4. Handling Special Tokens:** Special tokens like [CLS], [SEP], [PAD], etc., are added as required by the model architecture.




---




**TFAutoModelForSequenceClassification**

TFAutoModelForSequenceClassification is another versatile class from the Hugging Face library, used for loading pre-trained models designed for sequence classification tasks (e.g., sentiment analysis, spam detection).

Method as follows:

**1. Loading a Pre-trained Model:**

In [11]:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased")


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This method downloads and caches the pre-trained sequence classification model.

In [12]:
outputs = model(inputs)
logits = outputs.logits


In [13]:
logits

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-0.00322657,  0.09603457],
       [ 0.05330714,  0.11412238]], dtype=float32)>

The tokenized inputs are passed to the model, and it returns logits (raw prediction scores).

**Summary**

**AutoTokenizer:** Handles text preprocessing by converting raw text into model-compatible inputs.

**TFAutoModelForSequenceClassification:**
Loads a pre-trained model for sequence classification and processes tokenized inputs to produce predictions.

Using these tools, you can efficiently implement NLP tasks such as sentiment analysis, text classification, and more with pre-trained models from the Hugging Face library.