<a href="https://colab.research.google.com/github/SaketMunda/started-with-hugging-face/blob/master/hf_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End to End experiment of Natural Language Processing

This notebook contains the experiment to learn below things,

- HuggingFace library
- Text classification
- [Demo]()


## What we're going to build

We're going to build a movie rating classification model whether `positive`/`negative`.

On the basis given reviews (as text), our model will be able to predict if it's worth watching or not.

1. [Data](https://huggingface.co/datasets/stanfordnlp/imdb): Problem definition and dataset preparation
2. [Model](https://huggingface.co/distilbert/distilbert-base-uncased): Finding, Training and evaluate
3. [Demo](): Demo so that it is ready to use for the world or may be showcase your work.

# Workflow we're going to follow

1. Create and preprocess data
2. Define the model we'll use with `transformers.AutoModelForSequenceClassification`
3. Define training arguments (these are just hyperparameters for our model) with `transformers.TrainingArguments`
4. Pass `TrainingArguments` and target datasets to an instance of `transformers.Trainer`.
5. Train the model by calling `Trainer.train()`
6. Save the model
7. Evaluate the trained model by making predictions
8. Turn the model into a shareable demo -> we will upload in our server (HuggingFace server)


## Install dependencies

In [None]:
!pip install -U datasets evaluate accelerate gradio

In [2]:
try:
  import datasets, evaluate, accelerate
  import gradio as gr
except ModuleNotFoundError:
  !pip install -U datasets evaluate accelerate gradio # -U is for "upgrade"
  import datasets, evaluate, accelerate
  import gradio as gr

import random
import numpy as np
import pandas as pd
import torch
import transformers

In [3]:
print(f"Using pytorch version : {torch.__version__}")
print(f"Using transformers version : {transformers.__version__}")
print(f"Using datasets version : {datasets.__version__}")

Using pytorch version : 2.6.0+cu124
Using transformers version : 4.53.0
Using datasets version : 3.6.0


## Getting ready with the dataset

We will be using StanformNLP imdb ratings for our experiment : https://huggingface.co/datasets/stanfordnlp/imdb

### Loading the dataset

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

In [4]:
# load the dataset from HuggingFace Hub
dataset = datasets.load_dataset("stanfordnlp/imdb")

dataset

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
dataset.column_names

{'train': ['text', 'label'],
 'test': ['text', 'label'],
 'unsupervised': ['text', 'label']}

In [6]:
dataset['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [7]:
type(dataset['train'])

In [8]:
dataset['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [9]:
## inspect the data

random_indexs = random.sample(range(len(dataset["train"])), 5)
random_samples = dataset["train"][random_indexs]

print(f"[INFO] Random samples from dataset: \n")
for item in zip(random_samples["text"], random_samples["label"]):
  print(f"Text: {item[0]} | Label: {item[1]}")

[INFO] Random samples from dataset: 

Text: Ahh, yes, the all-star blockbuster. Take a so-so concept, stuff it into a script and load it down with every single freakin' special effect that the Wizards of Hollyweird can conjure up, then round up the usual suspects: hot up-and-comers, has-beens, wanna-be's and never-wuzzes, and stick 'em all in ensemble roles of various sizes in front of the unforgiving eye of the cameras. And hope to gawd that some of them aren't too old to remember their lines.<br /><br />Leave it to the bishops of Box Office to apply the concept to horror films at last, as was the case with the post-EXORCIST thriller THE SENTINEL. Novelist Jeffrey Konvitz decided to try and one-up Ira Levin's ROSEMARY'S BABY scenario of creepy (and ultimately satanic) neighbors in a New York brownstone. The result was a controversial best-seller that some claimed bordered on the plagiaristic, and an equally controversial, top-heavy/star-laden vehicle co-written and directed by DEATH W

In [10]:
dataset["train"].unique("label")

[0, 1]

In [11]:
movie_rating_df = pd.DataFrame(dataset["train"])
movie_rating_df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [12]:
movie_rating_df["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,12500
1,12500


### Preparing data for text classification

1. **Tokenization** - turning text into numerics
2. **Train/Test split** - creating a train test split of the data, to train the model with training data and evaluate the model with test data.



In [14]:
# create train/test splits
# dataset = dataset.train_test_split(test_size=0.2, seed=17)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [19]:
dataset["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

#### Tokenizing the text data

`"I love sci-fi" -> {"I": 0, "love": 1, "pizza": 2}`

In [21]:
from transformers import AutoTokenizer

model_path = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_path,
                                          use_fast=True)

tokenizer

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [25]:
tokenizer("I love sci-fi!")

{'input_ids': [101, 1045, 2293, 16596, 1011, 10882, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

`input_ids` are our tokens, `attention_mask` will be either `1` (means useful) or `0` (ignored).

In [30]:
print(f"Length of tokenizer vocabulary: {len(tokenizer.vocab)}")

Length of tokenizer vocabulary: 30522


In [31]:
print(f"Max tokenizer input sequence length: {tokenizer.model_max_length}")

Max tokenizer input sequence length: 512


In [35]:
tokenizer.vocab["and"]

1998

In [36]:
tokenizer("and")

{'input_ids': [101, 1998, 102], 'attention_mask': [1, 1, 1]}

In [37]:
tokenizer.convert_ids_to_tokens(tokenizer("and").input_ids)

['[CLS]', 'and', '[SEP]']

In [38]:
tokenizer.convert_ids_to_tokens(tokenizer("I love sci-fi!").input_ids)

['[CLS]', 'i', 'love', 'sci', '-', 'fi', '!', '[SEP]']

In [39]:
tokenizer.convert_ids_to_tokens(tokenizer("sneh").input_ids)

['[CLS]', 's', '##ne', '##h', '[SEP]']

In [40]:
tokenizer.convert_ids_to_tokens(tokenizer("💟").input_ids)

['[CLS]', '[UNK]', '[SEP]']

In [46]:
tokenizer("[PAD]")

{'input_ids': [101, 0, 102], 'attention_mask': [1, 1, 1]}

In [41]:
def tokenize_text(examples):
  """
  Tokenize given example text and return the tokenized text
  """
  return tokenizer(examples["text"],
                   padding=True,
                   truncation=True)

In [42]:
tokenized_dataset = dataset.map(function=tokenize_text,
                                batched=True,
                                batch_size=1000)

tokenized_dataset

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [43]:
# Get some random samples from the tokenized dataset
train_tokenized_sample = tokenized_dataset["train"][0]
test_tokenized_sample = tokenized_dataset["test"][0]

for key in train_tokenized_sample.keys():
  print(f"[INFO] Key: {key}")
  print(f"Train sample: {train_tokenized_sample[key]}")
  print(f"Test Sample: {test_tokenized_sample[key]}")
  print("")

[INFO] Key: text
Train sample: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scen