## Albert + Publish Custom Dataset to HuggingHub

In [1]:
!pip install -U transformers -q
!pip install accelerate -U -q
!pip install datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m111.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m97.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver d

## Curse of Dimensionality

<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/nlp/albert01.png" width=1000>

# ALBERT (A Lite BERT)

ALBERT is a variant of the BERT (Bidirectional Encoder Representations from Transformers) model, designed to be more efficient while maintaining performance. Key features and differences include:

- Parameter Reduction: ALBERT reduces the number of parameters compared to BERT by sharing parameters across layers and decomposing the embedding matrix into two smaller matrices. This makes ALBERT more efficient in terms of memory usage and computational cost.

- Architecture: Like BERT, ALBERT also uses a transformer architecture but with optimizations to improve efficiency and scalability.

- Training: ALBERT employs two key training techniques:

- Factorized Embedding Parameterization: This reduces the size of the vocabulary embeddings, making the model more efficient.

- Cross-Layer Parameter Sharing: This involves sharing parameters across different layers of the model, significantly reducing the number of parameters.

- Performance: Despite having fewer parameters, ALBERT achieves competitive performance on various natural language understanding tasks, such as those in the GLUE benchmark and SQuAD (Stanford Question Answering Dataset).

- Use Cases: ALBERT is particularly suited for tasks requiring efficient resource usage without compromising much on performance, such as large-scale deployment scenarios.


<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/nlp/albert00.png" width=1000>

<img src= "https://frenzy86.s3.eu-west-2.amazonaws.com/python/nlp/albert02.png" width=800>

In [2]:
#20 Billion Parameter GPT 3.5 vs 66Mil. DistillBert

20000000000/66000000 ## 300 times smaller DistillBert against ChatGPT3.5

303.030303030303

In [3]:
#20 Billion Parameter GPT 3.5 vs 66Mil. DistillBert

20000000000/11000000 ## 300 times smaller DistillBert against ChatGPT3.5

1818.1818181818182

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [5]:
!wget 'https://frenzy86.s3.eu-west-2.amazonaws.com/python/data/textsentiment.csv'

--2024-09-30 09:44:29--  https://frenzy86.s3.eu-west-2.amazonaws.com/python/data/textsentiment.csv
Resolving frenzy86.s3.eu-west-2.amazonaws.com (frenzy86.s3.eu-west-2.amazonaws.com)... 52.95.143.34, 3.5.245.19, 3.5.246.13, ...
Connecting to frenzy86.s3.eu-west-2.amazonaws.com (frenzy86.s3.eu-west-2.amazonaws.com)|52.95.143.34|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1942226 (1.9M) [text/csv]
Saving to: ‘textsentiment.csv’


2024-09-30 09:44:31 (1.62 MB/s) - ‘textsentiment.csv’ saved [1942226/1942226]



In [6]:
df = pd.read_csv('textsentiment.csv')
df.rename(columns={'sentiment':'label'},inplace=True)
df = df[['text','label']]
df

Unnamed: 0,text,label
0,Sooo SAD I will miss you here in San Diego!!!,negative
1,my boss is bullying me...,negative
2,what interview! leave me alone,negative
3,"Sons of ****, why couldn`t they put them on t...",negative
4,2am feedings for the baby are fun when he is a...,positive
...,...,...
16358,enjoy ur night,positive
16359,wish we could come see u on Denver husband l...,negative
16360,I`ve wondered about rake to. The client has ...,negative
16361,Yay good for both of you. Enjoy the break - y...,positive


In [7]:
## How to create a dataset for HF standard
from datasets import Dataset,DatasetDict

def create_dataset_splits(df, train_size=0.8, validation_size=0.1, test_size=0.1, seed=667):
    """
    Creates train, validation, and test splits from the given DataFrame
    Returns:
        DatasetDict: A DatasetDict containing 'train', 'validation', and 'test' splits.
    """
    assert train_size + validation_size + test_size == 1, "The sum of train_size, validation_size, and test_size must be 1."

    dataset = Dataset.from_pandas(df)
    ds_train_devtest = dataset.train_test_split(test_size=(1 - train_size), seed=seed)
    devtest_split = validation_size / (validation_size + test_size)
    ds_devtest = ds_train_devtest['test'].train_test_split(test_size=devtest_split, seed=seed)
    dataset_dict = DatasetDict({
                                'train': ds_train_devtest['train'],
                                'validation': ds_devtest['train'],
                                'test': ds_devtest['test']
                                })
    return dataset_dict

dataset = create_dataset_splits(df)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 13090
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1636
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1637
    })
})

## Push dataset to the HuggingfaceHub

In [8]:
# login - remember to get your token from the Hugging Face hub
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your ter

In [9]:
dataset.push_to_hub("Frenz/sentimente_test")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/14 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/567 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/Frenz/sentimente_test/commit/33e64fbbcab66210b02b3f0006fd0567c46abf2b', commit_message='Upload dataset', commit_description='', oid='33e64fbbcab66210b02b3f0006fd0567c46abf2b', pr_url=None, pr_revision=None, pr_num=None)

In [10]:
from datasets import load_dataset, DatasetDict

dataset = load_dataset("Frenz/sentimente_test")
dataset

train-00000-of-00001.parquet:   0%|          | 0.00/741k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/93.1k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/94.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13090 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1636 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1637 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 13090
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1636
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1637
    })
})

three different split of data: train, validation and test

#### Laber Encoder

In [11]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(dataset['train']['label'])

def encode_labels(example):
    return {'encoded_label': label_encoder.transform([example['label']])[0]}

for split in dataset:
    dataset[split] = dataset[split].map(encode_labels, batched=False)

Map:   0%|          | 0/13090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1636 [00:00<?, ? examples/s]

Map:   0%|          | 0/1637 [00:00<?, ? examples/s]

This is so we can get the actual label names rather than the numerical reps when we do inference with the model.

In [12]:
model_name = "albert/albert-base-v2"  #basemodel Albert
your_path = "modelsent_test"

In [13]:
from transformers import AutoConfig

unique_labels = sorted(list(set(dataset['train']['label'])))
id2label = {i: label for i, label in enumerate(unique_labels)}
label2id = {label: i for i, label in enumerate(unique_labels)}

config = AutoConfig.from_pretrained(model_name)
config.id2label = id2label
config.label2id = label2id

# Verify the correct labels
print("ID to Label Mapping:", config.id2label)
print("Label to ID Mapping:", config.label2id)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

ID to Label Mapping: {0: 'negative', 1: 'positive'}
Label to ID Mapping: {'negative': 0, 'positive': 1}


Despite other models such as BERT, DistilBERT or RoBERTa, you can use AutoTokenizer and AutoModelForSequenceClassification which will automatically select the correct classes for your specified model.

In [14]:
from transformers import AlbertForSequenceClassification, AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained(model_name)
model = AlbertForSequenceClassification.from_pretrained(model_name, config=config)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
# This next function filters for invalid content and then makes sure the text data is
# properly tokenized and labeled, preparing the dataset for training.
def filter_invalid_content(example):
    return isinstance(example['text'], str)

dataset = dataset.filter(filter_invalid_content, batched=False)

def encode_data(batch):
    tokenized_inputs = tokenizer(batch["text"], padding=True, truncation=True, max_length=256)
    tokenized_inputs["labels"] = batch["encoded_label"]
    return tokenized_inputs

dataset_encoded = dataset.map(encode_data, batched=True)
dataset_encoded

Filter:   0%|          | 0/13090 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1636 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1637 [00:00<?, ? examples/s]

Map:   0%|          | 0/13090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1636 [00:00<?, ? examples/s]

Map:   0%|          | 0/1637 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 13090
    })
    validation: Dataset({
        features: ['text', 'label', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1636
    })
    test: Dataset({
        features: ['text', 'label', 'encoded_label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1637
    })
})

In [16]:
dataset_encoded.set_format(type='torch',
                           columns=['input_ids', 'attention_mask', 'labels']
                           )

In [17]:
#We also need to fetch a data collator to handle padding for our inputs.
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer)
data_collator

DataCollatorWithPadding(tokenizer=AlbertTokenizer(name_or_path='albert/albert-base-v2', vocab_size=30000, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '<unk>', 'sep_token': '[SEP]', 'pad_token': '<pad>', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=Non

In [18]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np

label_encoder = LabelEncoder()
label_encoder.fit(unique_labels)

def per_label_accuracy(y_true, y_pred, labels):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    correct_predictions = cm.diagonal()
    label_totals = cm.sum(axis=1)
    per_label_acc = np.divide(correct_predictions, label_totals, out=np.zeros_like(correct_predictions, dtype=float), where=label_totals != 0)
    return dict(zip(labels, per_label_acc))

In [19]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    decoded_labels = label_encoder.inverse_transform(labels)
    decoded_preds = label_encoder.inverse_transform(preds)

    precision = precision_score(decoded_labels, decoded_preds, average='weighted')
    recall = recall_score(decoded_labels, decoded_preds, average='weighted')
    f1 = f1_score(decoded_labels, decoded_preds, average='weighted')
    acc = accuracy_score(decoded_labels, decoded_preds)

    labels_list = list(label_encoder.classes_)
    per_label_acc = per_label_accuracy(decoded_labels, decoded_preds, labels_list)

    per_label_acc_metrics = {}
    for label, accuracy in per_label_acc.items():
        label_key = f"accuracy_label_{label}"
        per_label_acc_metrics[label_key] = accuracy

    return {
            'accuracy': acc,
            'f1': f1,
            'precision': precision,
            'recall': recall,
            **per_label_acc_metrics
            }

## Training the Model


In [20]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
                                output_dir=your_path,
                                num_train_epochs=3,
                                warmup_steps=500,
                                per_device_train_batch_size=16,
                                per_device_eval_batch_size=16,
                                weight_decay=0.01,
                                logging_steps=10,
                                evaluation_strategy='steps',
                                eval_steps=100,
                                learning_rate=2e-5,
                                save_steps=1000,
                                gradient_accumulation_steps=2
                                )

In [21]:
trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=dataset_encoded['train'],
                eval_dataset=dataset_encoded['test'],
                compute_metrics=compute_metrics,
                tokenizer=tokenizer,
                data_collator=data_collator,
                )
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Accuracy Label Negative,Accuracy Label Positive
100,0.6006,0.557237,0.730605,0.730671,0.731331,0.730605,0.742424,0.719527
200,0.2819,0.319351,0.883934,0.883717,0.89042,0.883934,0.944444,0.827219
300,0.2392,0.242424,0.91631,0.916236,0.916739,0.91631,0.895202,0.936095
400,0.2603,0.226156,0.915699,0.915711,0.915766,0.915699,0.917929,0.913609
500,0.1964,0.339567,0.895541,0.895248,0.904212,0.895541,0.965909,0.829586
600,0.208,0.233709,0.911423,0.911337,0.911907,0.911423,0.888889,0.932544
700,0.2591,0.229377,0.919976,0.919991,0.920123,0.919976,0.925505,0.914793
800,0.0957,0.21467,0.924252,0.924267,0.924418,0.924252,0.930556,0.918343
900,0.1372,0.253633,0.926084,0.926097,0.926193,0.926084,0.930556,0.921893
1000,0.0782,0.288536,0.917532,0.917508,0.920384,0.917532,0.955808,0.881657


TrainOutput(global_step=1227, training_loss=0.23223331961674523, metrics={'train_runtime': 651.3866, 'train_samples_per_second': 60.287, 'train_steps_per_second': 1.884, 'total_flos': 157212915612480.0, 'train_loss': 0.23223331961674523, 'epoch': 2.9963369963369964})

## Evaluating the Model

In [None]:
trainer.evaluate()
trainer.save_model(your_path)
trainer.save_state()

## Run it locally with HuggingFace pipeline

In [None]:
from transformers import pipeline

pipeline_name = 'text-classification'
pipe = pipeline(pipeline_name, model=your_path)

In [None]:
example_titles = [
                "grab an example title",
                "grab another example title",
                "and another xample title",
                ]
for title in example_titles:
    result = pipe(title)
    print(f"Title: {title}")
    print(f"Output: {result[0]['label']}")

Title: grab an example title
Output: positive
Title: grab another example title
Output: positive
Title: and another xample title
Output: positive


## Push Model to HuggingfaceHub


In [None]:
#!huggingface-cli login

In [None]:
tokenizer.push_to_hub("Frenz/modelsent_test")
trainer.push_to_hub("Frenz/modelsent_test")

README.md:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

events.out.tfevents.1718536095.e074026fd21f.458.1:   0%|          | 0.00/694 [00:00<?, ?B/s]

events.out.tfevents.1718535344.e074026fd21f.458.0:   0%|          | 0.00/38.4k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.05k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Frenz/modelsent_test/commit/25702ffc51c6ab0516c57a3badc5a82b9911f417', commit_message='Frenz/modelsent_test', commit_description='', oid='25702ffc51c6ab0516c57a3badc5a82b9911f417', pr_url=None, pr_revision=None, pr_num=None)

## Load from HF

In [None]:
from transformers import pipeline

model_id = "Frenz/modelsent_test"
pipe = pipeline("text-classification", model=model_id)

config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.27M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
example_titles = [
                "grab an example title",
                "I hate you",
                "and another xample title"
                ]

for title in example_titles:
    result = pipe(title)
    print(f"Title: {title}")
    print(f"Output: {result[0]['label']}")

Title: grab an example title
Output: positive
Title: I hate you
Output: negative
Title: and another xample title
Output: positive
