In [12]:
from transformers import pipeline

Pipelines: https://huggingface.co/docs/transformers/en/main_classes/pipelines

In [2]:
# Sentiment classification is the automated process of identifying and classifying emotions in text as positive sentiment, 
# negative sentiment, or neutral sentiment based on the opinions expressed within. 
# It helps determine the nature and extent of feelings conveyed using Natural Language Processing (NLP) 
# to understand what customers say or feel about your brand, products, and services.
classifier = pipeline("sentiment-analysis")
res = classifier("I've been waiting for a HuggingFace course my whole life.")
print(res)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598049521446228}]


The sentiment label was 'POSITIVE' with a score of 96%.

The way "pipeline" works is:
- It first preprocesses the text, applies a tokenizer to the text.
- Then it feeds the preprocessed text to the model, and then applies the model.
- Then it does the postprocessing.  This means it will show the extpected result for the type of model used.

In [3]:
generator = pipeline("text-generation", model="distilgpt2")

res = generator(
    "In this course, we will teach you how to",
    max_length = 30, # Different arguments can be found in documentation.
    num_return_sequences = 2
)

print(res)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create a great website and create a great website. Then we will create a website, create a website'}, {'generated_text': 'In this course, we will teach you how to manage your time and resources in a more personalized way.\n\n\n\nThe course will take one'}]


In [11]:
for r in res:
    print(r['generated_text'])

In this course, we will teach you how to create a great website and create a great website. Then we will create a website, create a website
In this course, we will teach you how to manage your time and resources in a more personalized way.



The course will take one


Zero-Shot Classification:
https://huggingface.co/tasks/zero-shot-classification

In [2]:
classifier = pipeline("zero-shot-classification") # "zero-shot" meaning it will have no memory and be agnostic with the prompt.
res = classifier(
    "This is a course about Python list comprehension",
    candidate_labels = ["education", "politics", "business"] # Can use pipeline to classify sequences into any of the class names provided.
)
print(res)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about Python list comprehension', 'labels': ['education', 'business', 'politics'], 'scores': [0.9622026681900024, 0.026841334998607635, 0.010956003330647945]}


In [33]:
print(f"Given the provided text: \n'{res['sequence']}'\nThe classification scores for each label are:\n")
for i in range(len(res['labels'])):
    print("{}: {}".format(res['labels'][i], res['scores'][i]))

Given the provided text: 
'This is a course about Python list comprehension'
The classification scores for each label are:

education: 0.9622026681900024
business: 0.026841334998607635
politics: 0.010956003330647945


Tokenizer and Model

Import classes

In [13]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertTokenizer, BertModel

In [15]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english" # Specify a model name.  This is the default model used for the "sentiment-analysis" pipeline.
# HuggingFace's transformers library has pre-trained models or tokenizers.  The ".from_pretrained()" method in HuggingFace's transformers library allows a user to load one.
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Like with model, load a pre-trained tokenizer from HuggingFace's transformers library.
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [19]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
res = classifier("I've been waiting for a HuggingFace course my whole life.")
print(f"{res[0]['label']}: With a sentiment score of {res[0]['score']}")

POSITIVE: With a sentiment score of 0.9598049521446228


The result is the same as the example above, when model and tokenizer was not specified, and the classes "AutoTokenizer" and "AutoModelForSequenceClassification" were not specified.

The 'AutoTokenizer' identifies the pre-trained model and loads the appropriate tokenizer class designed to work with that model, (e.g. BERT, GPT-4, etc.).  It then automatically downloads and caches the tokenizer’s configuration, vocabulary, merges (if applicable), and any other necessary files. This information define how text should be split into tokens and how these tokens map to the model's input ids in order for the tokenizer to function correctly.

In [20]:
sequence = "Using a Transformer network is simple."
res = tokenizer(sequence)
print(res)
# tokens gives tokens back
tokens = tokenizer.tokenize(sequence)
print(tokens)
# ids gives the IDs of each token
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
# decoded_string goes the other way around and decodes the IDs to give the original sequence back.
decoded_string = tokenizer.decode(ids)
print(decoded_string)

{'input_ids': [101, 2478, 1037, 10938, 2121, 2897, 2003, 3722, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['using', 'a', 'transform', '##er', 'network', 'is', 'simple', '.']
[2478, 1037, 10938, 2121, 2897, 2003, 3722, 1012]
using a transformer network is simple.


The attention mask in the 'res' dictionary is binary and is used to selectively focus on relavent parts of the input data.  When sending a batch into a transformer, the examples in the batch may have varying lengths.  Attention masks can pad the sequences so that all the examples in the batch have the same length.  The relavent part in a shorter example would be the sentence itself.  In the example above, all the tokens are given "1" so they are not padding.

PyTorch

In [22]:
import torch
import torch.nn.functional as F

In [23]:
# Add the pipeline like before.
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [24]:
# Training Data.
X_train = ["I've been waiting for a HuggingFace course my whole life.", "Python is great!"]

In [25]:
res = classifier(X_train)
print(res)

[{'label': 'POSITIVE', 'score': 0.9598049521446228}, {'label': 'POSITIVE', 'score': 0.9998615980148315}]


Iferential statistics draws conclusions about populations based on a sample of the data.  Neural networks can give accurate predicitons based on the training data it is given, then tries to predict outcomes for new data sets.  Overfitting is when the model gives inaccurate predictions and has poor inference performance on new data.  The validation set is data not in the training set that is used as a measure on performance while training to test for overfitting.

In [None]:
# Apply the tokenizer to the X_train data, with padding to the longest sequence in the batch, (no padding for only a single sequence), 
# truncation to the 'max_length', (if max_length=None, this will truncate to the longest sequence of the data).
# Set to return tensor parameters set to "pt" or PyTorch.
batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors="pt")
print(batch)

# Since the model is not updating the parameters in the validation pass, the code can be sped up by turning off gradients using torch.no_grad():
with torch.no_grad():
    # Unpack the batch dictionary.
    outputs = model(**batch)
    print(outputs)
    # Apply the softmax function to get the prediction probabilities.
    predicitons = F.softmax(outputs.logits, dim=1)
    print(predicitons)
    # Get the labels at the probabilties which located at the maxima of the softmax function.
    labels = torch.argmax(predicitons, dim=1)
    print(labels)