PyTorch Deep Learning - Text Classification with Hugging Face

Overview
What is Hugging Face?
What We’re Going to Build
What is Text Classification?
Why Train Your Own Text Classification Models?
Workflow We’re Going to Follow
Importing Necessary Libraries
Getting a Dataset
Food Not Food Image Caption Dataset Creation
Where Can You Get More Datasets?
Setting Up a Model for Training
Acknowledgments

Overview

The text outlines a process for text classification that involves starting with a text dataset, building a classification model, and creating a shareable demo. The workflow will utilize various open-source tools from the Hugging Face ecosystem.

What is Hugging Face?

Hugging Face Overview
- Hugging Face is a leading AI company specializing in Natural Language Processing (NLP) and Machine Learning (ML).
- It provides pretrained models, datasets, and tools for deep learning applications.
- Hugging Face has become the go-to platform for open-source AI research and deployment.
Key Components
- Transformers Library – Provides state-of-the-art NLP models like BERT, GPT, T5, DistilBERT for tasks such as text classification, summarization, and translation.
- Datasets Library – Access and process large-scale ML datasets efficiently.
- Tokenizers Library – Fast, optimized tokenization for various transformer models.
- Accelerate Library – Simplifies distributed training and multi-GPU setups.
- Evaluate Library – Provides evaluation metrics for ML models.
Hugging Face Hub
- Model Hub – Hosts thousands of pretrained ML models.
- Dataset Hub – Collection of ready-to-use datasets.
- Spaces – Platform for hosting and sharing AI-powered demos with Gradio and Streamlit.
Why Use Hugging Face?
- Pretrained models allow for transfer learning to improve performance with minimal data.
- Easy-to-use APIs integrate with PyTorch, TensorFlow, and JAX.
- Supports open-source collaboration and community-driven AI advancements.

What we're going to build

• Building a text classification model to identify whether text (like image captions) is about food or not food • This is similar to technology used in the Nutrify app that helps users learn about food • The project follows three main steps:

Data: Problem definition and dataset preparation
Model: Finding, training and evaluating a text classification model using Hugging Face
Demo: Creating and sharing a demo for others to use • The finished project will result in a trained model with a shareable demo hosted on Hugging Face

What is text classification?

• Text classification is the process of assigning categories to text (words, phrases, sentences, paragraphs, or documents)

• Common text classification problems include spam detection, sentiment analysis, language detection, topic classification, hate speech detection, and product categorization

• Classification types include binary (one thing or another), multi-class (one from many), and multi-label (one or more from many)

• Text classification is widely used in business settings, such as categorizing insurance claims

• Models for text classification include:

Rule-based (simple but requires manual rule creation)
Bag of Words (simple but doesn't capture word order)
TF-IDF (weighs word importance but doesn't capture word order)
Deep learning (can learn complex patterns but requires more data/compute)

• Deep learning models often perform better with quality datasets, and Hugging Face facilitates their implementation

An example text classification problem to classify insurance claim texts into at fault or not fault. This result of the model would send the claim to a different department in the insurance company.

Why train your own text classification models?

Based on the document about why to train your own text classification models, here are the most important points:

• You can use pre-trained models, API-powered models, or large language models (GPT, Gemini, Claude, Mistral) for text classification

• Training/fine-tuning your own model offers several advantages:

Full control over model lifecycle
No usage limits (except compute constraints)
Train once and deploy anywhere
Better privacy with in-house data handling
Often faster performance with small, specialized models

• Using pre-built model APIs has different benefits:

Easy setup with minimal code
No need to maintain compute resources
Access to advanced models
Ability to scale with usage

• API models come with drawbacks:

Dependency on third-party service uptime
Data must be sent externally
May have daily/periodic usage limits
Often slower due to API call requirements

For this project, we're going to focus on fine-tuning our own model.

Workflow we're going to follow

Workflow Overview: The process follows the motto "data, model, demo!" for structured machine learning development.
Data Preparation:
- Create and preprocess the dataset for training.
Model Definition:
- Utilize transformers.AutoModelForSequenceClassification to define a text classification model.
Hyperparameter Configuration:
- Define training arguments using transformers.TrainingArguments (controls optimization, scheduling, etc.).
Training Setup:
- Initialize a transformers.Trainer instance with the training arguments and dataset.
Model Training:
- Execute training with Trainer.train().
Model Saving:
- Save the trained model locally or push it to the Hugging Face Hub.
Evaluation & Testing:
- Generate predictions on test data and analyze model performance.
Deployment & Demo:
- Convert the trained model into a shareable demo.
Non-Linearity of ML Projects:
- The workflow provides structure but allows flexibility for iterative experimentation.

But this worfklow will give us some good guidelines to follow.

Importing necessary libraries

Library Imports & Setup
- Ensure correct setup by following the setup guide for local environments.
- Google Colab users have most libraries pre-installed but need additional installations.
- Enable GPU in Google Colab via Runtime ➡️ Change runtime type ➡️ Hardware accelerator ➡️ GPU.
Required Libraries from Hugging Face Ecosystem
- transformers – Pre-installed on Colab; install with pip install transformers.
- datasets – Handles dataset access and manipulation; install with pip install datasets.
- evaluate – Provides performance evaluation metrics; install with pip install evaluate.
- accelerate – Optimizes ML model training; install with pip install accelerate.
- gradio – Builds interactive ML model demos; install with pip install gradio.
Checking Installed Versions
- Use package_name.__version__ to verify installed package versions.

Getting a dataset

Dataset Importance in Machine Learning
- The dataset choice directly influences the model type and output quality.
- High-quality datasets lead to better model performance, while poor datasets degrade model quality.
Text Classification Dataset Structure
- Typically consists of text samples (e.g., sentences, paragraphs) and corresponding labels.
- The example dataset contains synthetic image captions labeled as "food" or "not food".
Dataset Source
- Available on Hugging Face: mrdbourke/learn_hf_food_not_food_image_captions.
- Designed for practicing text classification tasks.

Food Not Food Image Caption Dataset Creation

Food Not Food Image Caption Dataset Creation
- Dataset creation process documented in this Google Colab notebook.
Synthetic Data Generation
- A Large Language Model (LLM) was used to generate food and non-food image captions.
- This technique is useful for bootstrapping dataset creation when real data is limited.
- Recommended workflow: prioritize real data and supplement with synthetic data when necessary.
Model Evaluation Best Practices
- Always evaluate and test models on real-world data rather than relying solely on synthetic data.

Where can you get more datasets?

Sources for Text-Based Datasets
- Hugging Face Hub – A vast collection of datasets for various NLP tasks.
- Hugging Face Text Classification Datasets – Specific datasets for text classification.
- Kaggle Datasets – A popular platform for diverse machine learning datasets.
Synthetic Dataset Creation with LLMs
- Large Language Models (LLMs) can generate synthetic data for text classification problems.
- Enables custom dataset creation when real-world data is limited.

Setting up a model for training

Steps for Model Training Setup
1. Preprocess Data – Prepare and clean dataset.
2. Define Model – Use transformers.AutoModelForSequenceClassification.
3. Set Training Arguments – Configure hyperparameters with transformers.TrainingArguments.
4. Initialize Trainer – Pass TrainingArguments and dataset to transformers.Trainer.
5. Train the Model – Call Trainer.train().
6. Save the Model – Store locally or on Hugging Face Hub.
7. Evaluate Performance – Make predictions and analyze test data results.
8. Deploy Model – Create a shareable demo.
Using Pretrained Models for Transfer Learning
- Load models with from_pretrained().
- Pretrained Model: distilbert/distilbert-base-uncased, trained on BookCorpus and English Wikipedia.
- Transfer Learning Benefits:
  1. Achieves good results with limited data.
  2. Can be adapted across various domains (e.g., vision, audio, NLP).
- Key Question: "Does a pretrained model exist for my task, and can I fine-tune it?"
Model Customization & Configuration
- Use AutoModelForSequenceClassification for text classification.
- Customize model architecture with pretrained_model_name_or_path and num_labels.
- Adjust classification head with transformers.PretrainedConfig to set id2label and label2id.
Further Learning
- Example of Transfer Learning in PyTorch: PyTorch Transfer Learning Guide.

The content is based on Daniel's comprehensive deep learning course Text Classification with Hugging Face Transformers and reflects his expertise in making complex deep learning concepts accessible through practical, hands-on examples.

Visit Daniel's GitHub profile for more resources on machine learning and deep learning.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
cert		cert
demos/food_not_food_text_classifier		demos/food_not_food_text_classifier
.gitignore		.gitignore
README.md		README.md
demo_1_food.png		demo_1_food.png
demo_2_not_food.png		demo_2_not_food.png
huggingface_text_classification_tutorial.ipynb		huggingface_text_classification_tutorial.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyTorch Deep Learning - Text Classification with Hugging Face

Table of Contents

Overview

What is Hugging Face?

What we're going to build

What is text classification?

Why train your own text classification models?

Workflow we're going to follow

Importing necessary libraries

Getting a dataset

Food Not Food Image Caption Dataset Creation

Where can you get more datasets?

Setting up a model for training

About

Uh oh!

Releases

Packages

Languages

Adnan-edu/hugging_custom_ai_model

Folders and files

Latest commit

History

Repository files navigation

PyTorch Deep Learning - Text Classification with Hugging Face

Table of Contents

Overview

What is Hugging Face?

What we're going to build

What is text classification?

Why train your own text classification models?

Workflow we're going to follow

Importing necessary libraries

Getting a dataset

Food Not Food Image Caption Dataset Creation

Where can you get more datasets?

Setting up a model for training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages