Skip to content

Adnan-edu/hugging_custom_ai_model

Repository files navigation

PyTorch Deep Learning - Text Classification with Hugging Face

Table of Contents

  1. Overview

  2. What is Hugging Face?

  3. What We’re Going to Build

  4. What is Text Classification?

  5. Why Train Your Own Text Classification Models?

  6. Workflow We’re Going to Follow

  7. Importing Necessary Libraries

  8. Getting a Dataset

  9. Food Not Food Image Caption Dataset Creation

  10. Where Can You Get More Datasets?

  11. Setting Up a Model for Training

  12. Acknowledgments

Overview

The text outlines a process for text classification that involves starting with a text dataset, building a classification model, and creating a shareable demo. The workflow will utilize various open-source tools from the Hugging Face ecosystem.

What is Hugging Face?

  • Hugging Face Overview

    • Hugging Face is a leading AI company specializing in Natural Language Processing (NLP) and Machine Learning (ML).
    • It provides pretrained models, datasets, and tools for deep learning applications.
    • Hugging Face has become the go-to platform for open-source AI research and deployment.
  • Key Components

    • Transformers Library – Provides state-of-the-art NLP models like BERT, GPT, T5, DistilBERT for tasks such as text classification, summarization, and translation.
    • Datasets Library – Access and process large-scale ML datasets efficiently.
    • Tokenizers Library – Fast, optimized tokenization for various transformer models.
    • Accelerate Library – Simplifies distributed training and multi-GPU setups.
    • Evaluate Library – Provides evaluation metrics for ML models.
  • Hugging Face Hub

    • Model Hub – Hosts thousands of pretrained ML models.
    • Dataset Hub – Collection of ready-to-use datasets.
    • Spaces – Platform for hosting and sharing AI-powered demos with Gradio and Streamlit.
  • Why Use Hugging Face?

    • Pretrained models allow for transfer learning to improve performance with minimal data.
    • Easy-to-use APIs integrate with PyTorch, TensorFlow, and JAX.
    • Supports open-source collaboration and community-driven AI advancements.

What we're going to build

• Building a text classification model to identify whether text (like image captions) is about food or not food • This is similar to technology used in the Nutrify app that helps users learn about food • The project follows three main steps:

  1. Data: Problem definition and dataset preparation
  2. Model: Finding, training and evaluating a text classification model using Hugging Face
  3. Demo: Creating and sharing a demo for others to use • The finished project will result in a trained model with a shareable demo hosted on Hugging Face

What is text classification?

• Text classification is the process of assigning categories to text (words, phrases, sentences, paragraphs, or documents)

• Common text classification problems include spam detection, sentiment analysis, language detection, topic classification, hate speech detection, and product categorization

• Classification types include binary (one thing or another), multi-class (one from many), and multi-label (one or more from many)

• Text classification is widely used in business settings, such as categorizing insurance claims

• Models for text classification include:

  • Rule-based (simple but requires manual rule creation)
  • Bag of Words (simple but doesn't capture word order)
  • TF-IDF (weighs word importance but doesn't capture word order)
  • Deep learning (can learn complex patterns but requires more data/compute)

• Deep learning models often perform better with quality datasets, and Hugging Face facilitates their implementation

An example text classification problem to classify insurance claim texts into at fault or not fault. This result of the model would send the claim to a different department in the insurance company.

Why train your own text classification models?

Based on the document about why to train your own text classification models, here are the most important points:

• You can use pre-trained models, API-powered models, or large language models (GPT, Gemini, Claude, Mistral) for text classification

• Training/fine-tuning your own model offers several advantages:

  • Full control over model lifecycle
  • No usage limits (except compute constraints)
  • Train once and deploy anywhere
  • Better privacy with in-house data handling
  • Often faster performance with small, specialized models

• Using pre-built model APIs has different benefits:

  • Easy setup with minimal code
  • No need to maintain compute resources
  • Access to advanced models
  • Ability to scale with usage

• API models come with drawbacks:

  • Dependency on third-party service uptime
  • Data must be sent externally
  • May have daily/periodic usage limits
  • Often slower due to API call requirements

For this project, we're going to focus on fine-tuning our own model.

Workflow we're going to follow

  • Workflow Overview: The process follows the motto "data, model, demo!" for structured machine learning development.

  • Data Preparation:

    • Create and preprocess the dataset for training.
  • Model Definition:

    • Utilize transformers.AutoModelForSequenceClassification to define a text classification model.
  • Hyperparameter Configuration:

    • Define training arguments using transformers.TrainingArguments (controls optimization, scheduling, etc.).
  • Training Setup:

    • Initialize a transformers.Trainer instance with the training arguments and dataset.
  • Model Training:

    • Execute training with Trainer.train().
  • Model Saving:

    • Save the trained model locally or push it to the Hugging Face Hub.
  • Evaluation & Testing:

    • Generate predictions on test data and analyze model performance.
  • Deployment & Demo:

    • Convert the trained model into a shareable demo.
  • Non-Linearity of ML Projects:

    • The workflow provides structure but allows flexibility for iterative experimentation.

But this worfklow will give us some good guidelines to follow.

Importing necessary libraries

  • Library Imports & Setup

    • Ensure correct setup by following the setup guide for local environments.
    • Google Colab users have most libraries pre-installed but need additional installations.
    • Enable GPU in Google Colab via Runtime ➡️ Change runtime type ➡️ Hardware accelerator ➡️ GPU.
  • Required Libraries from Hugging Face Ecosystem

    • transformers – Pre-installed on Colab; install with pip install transformers.
    • datasets – Handles dataset access and manipulation; install with pip install datasets.
    • evaluate – Provides performance evaluation metrics; install with pip install evaluate.
    • accelerate – Optimizes ML model training; install with pip install accelerate.
    • gradio – Builds interactive ML model demos; install with pip install gradio.
  • Checking Installed Versions

    • Use package_name.__version__ to verify installed package versions.

Getting a dataset

  • Dataset Importance in Machine Learning

    • The dataset choice directly influences the model type and output quality.
    • High-quality datasets lead to better model performance, while poor datasets degrade model quality.
  • Text Classification Dataset Structure

    • Typically consists of text samples (e.g., sentences, paragraphs) and corresponding labels.
    • The example dataset contains synthetic image captions labeled as "food" or "not food".
  • Dataset Source

Food Not Food Image Caption Dataset Creation

  • Food Not Food Image Caption Dataset Creation

  • Synthetic Data Generation

    • A Large Language Model (LLM) was used to generate food and non-food image captions.
    • This technique is useful for bootstrapping dataset creation when real data is limited.
    • Recommended workflow: prioritize real data and supplement with synthetic data when necessary.
  • Model Evaluation Best Practices

    • Always evaluate and test models on real-world data rather than relying solely on synthetic data.

Where can you get more datasets?

  • Sources for Text-Based Datasets

  • Synthetic Dataset Creation with LLMs

    • Large Language Models (LLMs) can generate synthetic data for text classification problems.
    • Enables custom dataset creation when real-world data is limited.

Setting up a model for training

  • Steps for Model Training Setup

    1. Preprocess Data – Prepare and clean dataset.
    2. Define Model – Use transformers.AutoModelForSequenceClassification.
    3. Set Training Arguments – Configure hyperparameters with transformers.TrainingArguments.
    4. Initialize Trainer – Pass TrainingArguments and dataset to transformers.Trainer.
    5. Train the Model – Call Trainer.train().
    6. Save the Model – Store locally or on Hugging Face Hub.
    7. Evaluate Performance – Make predictions and analyze test data results.
    8. Deploy Model – Create a shareable demo.
  • Using Pretrained Models for Transfer Learning

  • Model Customization & Configuration

    • Use AutoModelForSequenceClassification for text classification.
    • Customize model architecture with pretrained_model_name_or_path and num_labels.
    • Adjust classification head with transformers.PretrainedConfig to set id2label and label2id.
  • Further Learning

Credits Badge

The content is based on Daniel's comprehensive deep learning course Text Classification with Hugging Face Transformers and reflects his expertise in making complex deep learning concepts accessible through practical, hands-on examples.

Visit Daniel's GitHub profile for more resources on machine learning and deep learning.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published