# Utilizing Hugging Faces Pipelines for Natural Language Processing (NLP)

## Project Overview  

As a **data science student**, I’m exploring **Hugging Face** and its capabilities, sharing insights on how to leverage these powerful tools for NLP. This project covers:  

### Key Topics  
- **Pipelines** – Abstracting complex NLP tasks into a few lines of code  
- **Tokenizers & Models** – Understanding how text is processed  
- **PyTorch** – Running models with your preferred framework  
- **Saving & Loading Models** – Managing models efficiently  

### What is Hugging Face? 🤗  

Hugging Face is a leading platform in **Natural Language Processing (NLP)** and **Machine Learning (ML)**, providing state-of-the-art transformer models, tools, and an open-source ecosystem for working with AI. It simplifies complex deep learning tasks and allows users to apply powerful models with minimal effort.  

Learn more about the [pipelines in Hugging Face](https://huggingface.co/docs/transformers/en/main_classes/pipelines), where you can see a comprehensive list here of all the pipelines you can call from Hugging Face.

See a comprehensive list of

There is an entire course for Hugging face here: [Hugging Face NLP Course](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt)

You can see the complete list of models, in the [Model Hub](https://huggingface.co/models)

### Why is Hugging Face Valuable?  

- **Ease of Use** – Apply advanced NLP models with just a few lines of code.  
- **Pre-trained Models** – Access thousands of ready-to-use models from the **Model Hub**, reducing training time and cost.  
- **Interoperability** – Works seamlessly with **PyTorch**, **TensorFlow**, and **JAX**.  
- **Fine-tuning & Customization** – Train models on your own data for domain-specific applications.  
- **Open-Source & Community-Driven** – Supported by a vast community contributing to continuous improvements.  

### What Are Transformers?  

**Transformers** are a type of deep learning model that have revolutionized **Natural Language Processing (NLP)**. Introduced in the paper *"Attention Is All You Need"* by **Vaswani et al. (2017)**, transformers use a mechanism called **self-attention** to process words in relation to all other words in a sentence, rather than sequentially like traditional models.  

### Why Are Transformers Powerful?  

- **Parallelization** – Unlike older models (e.g., RNNs), transformers process words **simultaneously**, making them highly efficient.  
- **Context Awareness** – They capture the **meaning of words** based on their **entire context**, rather than just nearby words.  
- **Scalability** – Transformers power massive models like **GPT-4, BERT, and T5**, capable of human-like text understanding and generation.  
- **Versatility** – Used for **text generation, translation, question answering, summarization, and more!**  

---

#### Installing Hugging Face Transformers

In [None]:
# Installing hugging face transformers
!pip install transformers

#### 1) Abstracting away NLP tasks with Hugging Face's 'Pipelines'
This demonstrates the power of Hugging Face's pipeline abstraction, which simplifies working with complex NLP models. By simply specifying "sentiment-analysis", we load a pre-trained model that handles text classification without needing to fine-tune or preprocess data manually.

Hugging Face provides various pipelines for different NLP tasks, such as:
- "text-generation" for AI-powered writing
- "translation" for multilingual applications
- "ner" (Named Entity Recognition) for extracting key information
- "question-answering" for answering queries from context
- 
This abstraction makes NLP accessible, allowing developers to quickly experiment with state-of-the-art models without deep ML expertise.

---

In [87]:
# Importing the pipeline function from Hugging Face's transformers library
from transformers import pipeline

# Initializing a sentiment-analysis pipeline using a pre-trained model.
# By default, the pipeline function will use a pre-trained model for sentiment analysis, which is typically fine-tuned on sentiment classification tasks.
# The model used here will likely be a binary sentiment model (positive/negative).
classifier = pipeline("sentiment-analysis")

# Running the classifier on a sample text (a joke here) to analyze its sentiment.
# The input text will be processed, and the model will classify its sentiment as either positive or negative.
result = classifier("Why did the astronaut break up with the alien? Because the relationship needed more space!")

# Printing the result, which contains the model's sentiment prediction (label and confidence score).
# The output will include the sentiment label (e.g., "POSITIVE" or "NEGATIVE") and the model's confidence score, indicating how sure the model is about its prediction.
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9987579584121704}]


#### 2) Using the Text-Generation Pipeline and Selecting a Pre-Trained Model
By selecting a pre-trained model, such as DistilGPT-2, a distilled (smaller and faster) version of GPT-2, we can generate contextually relevant and coherent text based on an input prompt.

The text-generation pipeline allows you to:
- Initialize the generator with a pre-trained model.
- Provide an input prompt that serves as the basis for generating text.
- Control the length of the generated text (max_length).
- Generate multiple variations of the generated text (num_return_sequences).

---

In [43]:
# Initializing a text-generation pipeline using a pre-trained model.
# The 'distilgpt2' model is a smaller, distilled version of GPT-2, designed for text generation.
# It generates coherent and contextually relevant text based on the input prompt.
generator = pipeline("text-generation", model="distilgpt2")

# Running the text generator on the provided prompt.
# The model will generate new text based on the starting sentence "The President of the United States is writing a bill about".
# Parameters:
# - max_length=30: The maximum length of the generated text (including the input prompt).
# - num_return_sequences=2: The number of different text completions to generate.
result = generator(
    "The President of the United States is writing a bill about",
    max_length=30,
    num_return_sequences=2,
)

# Printing the result, which contains the generated text based on the input prompt.
# The result will include two generated sequences since num_return_sequences=2.
# Each sequence will have a maximum length of 30 tokens (including the prompt).
print(result)


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The President of the United States is writing a bill about the future of our nation's most powerful financial system. And I'm in favor of what we"}, {'generated_text': 'The President of the United States is writing a bill about how we can strengthen our law and protect our democracy. The bill proposes to change federal laws for'}]


#### 3) Expanding on the Previous Idea with Zero-Shot Classification
Zero-shot classification is a machine learning technique where a model is able to classify data into categories it has never seen during training. Instead of training on a specific set of labels, the model is provided with a set of candidate labels and can classify new, unseen data based on its understanding of the relationships between the input text and those labels.

This is especially useful when you don’t have labeled data for every possible category or when new categories arise after training, allowing the model to make predictions without needing additional training on those specific labels. In the context of Hugging Face, it leverages a pre-trained model (like BART or RoBERTa) to perform this task efficiently.

Notice the highest score, which should be 'politics', is mostly correct.

---

In [52]:
# Importing the pipeline function from Hugging Face's transformers library
from transformers import pipeline

# Initializing a zero-shot classification pipeline.
# The 'zero-shot-classification' task allows the model to classify text into categories, even if it has not been explicitly trained on those categories.
# This is useful when you have a text but don't have a model specifically fine-tuned for your categories.
classifier = pipeline("zero-shot-classification")

# Running the classifier on a sample text to classify its topic.
# The input text ("The President of the United States is writing a bill about") is passed to the model for classification.
# Candidate labels are the potential categories the text could belong to: "government", "politics", and "law".
result = classifier(
    "The President of the United States is writing a bill about",
    candidate_labels=["government", "politics", "law"],
)

# Printing the result, which contains the classification output.
# The result will indicate which candidate label the model predicts is the most appropriate for the input text,
# along with a confidence score for the classification.
print(result)


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'The President of the United States is writing a bill about', 'labels': ['government', 'law', 'politics'], 'scores': [0.5303876996040344, 0.28375473618507385, 0.18585756421089172]}


#### 4) Using Tokenizers with Hugging Face Sentiment Analysis

A tokenizer basically puts a text into a mathematical representation that the model can understand. This code demonstrates how to use Hugging Face's tokenizers and models for sentiment analysis and tokenization tasks. It shows how to load pre-trained models, tokenize a sequence, convert tokens to IDs, and decode them back to text.
 
---

In [70]:
# Importing the required libraries for Tokenizer and model loading.
# AutoTokenizer and AutoModelForSequenceClassification are used for loading pre-trained models and tokenizers from Hugging Face.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

# Initializing a sentiment-analysis pipeline using a pre-trained model (default model for sentiment-analysis).
# The pipeline function handles the preprocessing, model inference, and post-processing for sentiment analysis.
classifier = pipeline("sentiment-analysis")

# Running the classifier on a sample joke to analyze its sentiment.
# The model will classify the sentiment as either POSITIVE or NEGATIVE.
result = classifier("Why did the astronaut break up with the alien? Because the relationship needed more space!")
print(result)

# Specifying the model name for tokenization and classification.
# The model 'distilbert-base-uncased-finetuned-sst-2-english' is a pre-trained DistilBERT model fine-tuned on the SST-2 dataset for sentiment analysis.
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Loading the pre-trained model for sequence classification (sentiment analysis).
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Loading the tokenizer associated with the model.
# The tokenizer is used to convert the input text into tokens that the model can process.
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenizing the input text sequence into tokens (subword units).
sequence = "Transformers are really cool, and not so difficult"
result = tokenizer(sequence)
print(result)

# Tokenizing the sequence into individual tokens (words/subword units) that the model understands.
tokens = tokenizer.tokenize(sequence)
print(tokens)

# Converting tokens to their corresponding numerical IDs, as the model works with token IDs, not raw text.
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

# Decoding the token IDs back to human-readable text.
# This demonstrates how the tokenized input is represented and can be decoded back into text.
decoded_string = tokenizer.decode(ids)
print(decoded_string)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9987579584121704}]
{'input_ids': [101, 19081, 2024, 2428, 4658, 1010, 1998, 2025, 2061, 3697, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['transformers', 'are', 'really', 'cool', ',', 'and', 'not', 'so', 'difficult']
[19081, 2024, 2428, 4658, 1010, 1998, 2025, 2061, 3697]
transformers are really cool, and not so difficult


#### 5) Implementing PyTorch into Natural Language Processing
This code performs sentiment analysis using Hugging Face's Transformers and PyTorch. It first uses a pre-trained DistilBERT model in a sentiment analysis pipeline to classify a joke's sentiment. Then, it manually tokenizes the input text, converts it into tensors, and runs it through the model without computing gradients. The raw model outputs (logits) are processed with softmax to get probability scores, and the final sentiment label is determined using argmax(). This approach provides both a high-level and low-level understanding of sentiment classification.

---

In [77]:
# Import necessary libraries from Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch  # PyTorch library for tensor operations
import torch.nn.functional as F  # Functional module for softmax and other operations

# Define the pre-trained sentiment analysis model (DistilBERT fine-tuned on SST-2 dataset)
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load the pre-trained model for sequence classification (sentiment analysis)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Load the tokenizer associated with the model to process text input
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the sentiment analysis pipeline using the model and tokenizer
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Define a sample input sentence for sentiment analysis
X_train = ["Why did the astronaut break up with the alien? Because the relationship needed more space!"]
# Note: Even though this is a single sentence, it's placed in a list for batch processing compatibility.

# Perform sentiment analysis using the pipeline
result = classifier(X_train)
print(result)  # Output contains sentiment label and confidence score

# Tokenize the input text into a format suitable for the model
batch = tokenizer(
    X_train,           # Input text list
    padding=True,      # Pads the input to the same length for batch processing
    truncation=True,   # Truncates input if it exceeds max length (512 tokens)
    max_length=512,    # Maximum token limit for DistilBERT
    return_tensors="pt"  # Return as PyTorch tensors for model compatibility
)
print(batch)  # Display tokenized batch

# Perform inference without calculating gradients (for efficiency)
with torch.no_grad():
    outputs = model(**batch)  # Forward pass through the model
    print(outputs)  # Print raw output logits

    # Apply softmax to convert logits into probability scores for each class (POSITIVE or NEGATIVE)
    predictions = F.softmax(outputs.logits, dim=1)
    print(predictions)  # Print class probabilities

    # Get the predicted class by taking the index of the highest probability
    labels = torch.argmax(predictions, dim=1)
    print(labels)  # Output is either 0 (NEGATIVE) or 1 (POSITIVE)

Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9987579584121704}]
{'input_ids': tensor([[  101,  2339,  2106,  1996, 19748,  3338,  2039,  2007,  1996,  7344,
          1029,  2138,  1996,  3276,  2734,  2062,  2686,   999,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[ 3.6659, -3.0238]]), hidden_states=None, attentions=None)
tensor([[0.9988, 0.0012]])
tensor([0])


#### 7) Saving and Loading a Model
This process ensures that a trained model and tokenizer can be saved and reloaded, making deployment and reuse more efficient.

---

In [None]:
# Define the directory where the model and tokenizer will be saved.
save_directory = "saved"

# Save the tokenizer to the specified directory.
tokenizer.save_pretrained(save_directory)

# Save the trained model to the specified directory.
model.save_pretrained(save_directory)

# Load the tokenizer from the saved directory.
tok = AutoTokenizer.from_pretrained(save_directory)

# Load the model from the saved directory.
mod = AutoModelForSequenceClassification.from_pretrained(save_directory)