# HuggingFace

Hugging Face offers everything from tokenizers, which help computers make sense of text, to a huge variety of ready-to-go language models, and even a treasure trove of data suited for language tasks.

HF provides many things, some of which are:
1. Tokenizers
2. Models
3. Datasets
4. Trainers

**Tokenizers:** These work like a translator, converting the words we use into smaller parts and creating a secret code that computers can understand and work with.

**Models:** These are like the brain for computers, allowing them to learn and make decisions based on information they've been fed.

**Datasets:** Think of datasets as textbooks for computer models. They are collections of information that models study to learn and improve.

**Trainers:** Trainers are the coaches for computer models. They help these models get better at their tasks by practicing and providing guidance. HuggingFace Trainers implement the PyTorch training loop for you, so you can focus instead on other aspects of working on the model.



## Tokenizers

HuggingFace tokenizers help us break down text into smaller, manageable pieces called tokens. These tokenizers are easy to use and also remarkably fast due to their use of the Rust programming language.

**Tokenization:** The process by which an input series of characters is transformed into units the model is prepared to predict upon. A model trained on data tokenized by one tokenizer must use that same tokenizer for prediction; this is similar to feature engineering in traditional machine learning. It's like cutting a sentence into individual pieces, such as words or characters, to make it easier to analyze.

**Tokens:** Fundamental unit of input to language models. These are the pieces you get after cutting up text during tokenization, kind of like individual Lego blocks that can be words, parts of words, or even single letters. These tokens are converted to numerical values for models to understand.

**Pre-trained Model:** This is a ready-made model that has been previously taught with a lot of data.

**Uncased:** This means that the model treats uppercase and lowercase letters as the same.

In [1]:
from transformers import BertTokenizer

In [2]:
# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [3]:
# See how many tokens are in the vocabulary
tokenizer.vocab_size

30522

In [4]:
# Tokenize the sentence
sent_0 = "I heart Generative AI"
tokens = tokenizer.tokenize(sent_0)

In [5]:
# Print the tokens
print(tokens)

['i', 'heart', 'genera', '##tive', 'ai']


In [6]:
# Show the token ids assigned to each token
print(tokenizer.convert_tokens_to_ids(tokens))

[1045, 2540, 11416, 6024, 9932]


In [7]:
dict(zip(tokens,tokenizer.convert_tokens_to_ids(tokens)))

{'i': 1045, 'heart': 2540, 'genera': 11416, '##tive': 6024, 'ai': 9932}

## Models

Hugging Face models provide a quick way to get started using models trained by the community. With only a few lines of code, you can load a pre-trained model and start using it on tasks such as sentiment analysis.

In [8]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

In [9]:
# Load a pre-trained sentiment analysis model
model_name = "textattack/bert-base-uncased-imdb"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = BertTokenizer.from_pretrained(model_name)

In [10]:
# Tokenize the input sequence
sent = "I love mathematics"
inputs = tokenizer.tokenize(text=sent)
dict(zip(inputs,tokenizer.convert_tokens_to_ids(inputs)))

{'i': 1045, 'love': 2293, 'mathematics': 5597}

In [11]:
# Make prediction
def use_model(input_text):
    inputs = tokenizer(
        text=input_text,
        return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs).logits
        probabilities = torch.nn.functional.softmax(outputs, dim=1)
        predicted_class = torch.argmax(probabilities)
    if predicted_class == 1:
        print(f"Sentiment: Positive ({probabilities[0][1] * 100:.2f}%)")
    else:
        print(f"Sentiment: Negative ({probabilities[0][0] * 100:.2f}%)")
    label = model.config.id2label[predicted_class.item()]
    arg_ind = predicted_class.item()
    print(f"\tModel label: {label}")
    print(f"\tModel arg_index: {arg_ind}")

In [12]:
use_model(sent)

Sentiment: Positive (82.51%)
	Model label: LABEL_1
	Model arg_index: 1


In [13]:
# Alternatively:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [14]:
use_model(sent)

Sentiment: Positive (82.51%)
	Model label: LABEL_1
	Model arg_index: 1


In [15]:
# Alternatively:
from transformers import pipeline
pipe = pipeline("text-classification", model="textattack/bert-base-uncased-imdb")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Device set to use mps:0


In [16]:
out = pipe(sent)
print(out[0])
print()
label = out[0]['label']
score = out[0]['score']
if label == "LABEL_1":
    print(f"Sentiment: Positive ({score * 100:.2f}%)")
else:
    print(f"Sentiment: Negative ({score * 100:.2f}%)")
# label = model.config.id2label[predicted_class.item()]
# arg_ind = predicted_class.item()
print(f"\tModel label: {label}")
print(f"\tModel score: {score}")

{'label': 'LABEL_1', 'score': 0.8251070380210876}

Sentiment: Positive (82.51%)
	Model label: LABEL_1
	Model score: 0.8251070380210876
