####The first step is to install the libraries to use the datasets library and the transformers library (evaluate is optional, I prefer the classic methods offered by sklearn)

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

####The aim will be to start from the dataset in csv format (I downloaded it at this link on Kaggle: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) and build a DatasetDict containing three datasets for training, validation and testing of the model

* NOTE: if you use Google Colab (https://colab.google/) like me, I recommend connecting it to the drive and storing the original dataset and/or processed versions there for reuse, as manually loading the dataset is a very slow operation. Another alternative is to use the command offered by kaggle to download the dataset, but every time you access colab the command must be downloaded and configured, in short, use the drive xD

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from datasets import load_dataset

# When you use the load_dataset function, it generates a DatasetDict with just 1 Dataset called "train" ad default
# Later we will split the dataset

imdb_dataset = load_dataset("csv", data_files='/content/drive/MyDrive/IMDB Dataset.csv')

In [5]:
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 50000
    })
})

####If you want to see examples of the dataset, I always recommend creating a sample of the dataset. In this way, if you wanted to try operations to understand their effects you would not act on your original dataset

In [None]:
imdb_sample = imdb_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
imdb_sample[:3]

####For example, suppose we want to work with a model that wants strings in the lower case, we process the dataset by creating an ad hoc function

In [5]:
def lowercase_function(example):
    return {"review": example["review"].lower()}

# NOTE: the map method applies the function to each dataset present in the DatasetDict if you apply it on a DatasetDict object
imdb_sample_lowercase = imdb_sample.map(lowercase_function)

print(f"Sample not processed by the lowercase function: {imdb_sample[0]}")
print(f"Sample processed by the lowercase function: {imdb_sample_lowercase[0]}")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Sample not processed by the lowercase function: {'review': 'Arguably the finest serial ever made(no argument here thus far) about Earthman Flash Gordon, Professor Zarkov, and beautiful Dale Arden traveling in a rocket ship to another universe to save the planet. Along the way, in spellbinding, spectacular, and action-packed chapters Flash and his friends along with new found friends such as Prince Barin, Prince Thun, and the awesome King Vultan pool their resources together to fight the evils and armies of the merciless Ming of Mongo and the jealous treachery of his daughter Priness Aura(now she\'s a car!). This serial is not just a cut above most serials in terms of plot, acting, and budget - it is miles ahead in these areas. Produced by Universal Studios it has many former sets at its disposable like the laboratory set from The Bride of Frankenstein and the Opera House from The Phantom of the Opera just to name a few. The production values across the board are advanced, in my most hu

#### IMDB dataset consists of reviews. Each of these has a label that indicates its positivity or negativity. Since the dataset is very large, we will extract a small part of it, but you are free to try with the entire dataset.

#### If you want to work with a small part of the dataset, it is useful to think about extracting the same number of samples for the two classes. This is because, by randomly extracting samples, we could be subject to the class imbalance problem.

In [6]:
print(f"Positive Sample: {imdb_dataset['train'][:]['sentiment'].count('positive')}")
print(f"Negative Sample: {imdb_dataset['train'][:]['sentiment'].count('negative')}")

Positive Sample: 25000
Negative Sample: 25000


In [4]:
imdb_pos = imdb_dataset['train'].filter(lambda example: example["sentiment"] == "positive")
imdb_neg = imdb_dataset['train'].filter(lambda example: example["sentiment"] == "negative")

Filter:   0%|          | 0/50000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [8]:
print(imdb_pos)
print(imdb_neg)

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 25000
})
Dataset({
    features: ['review', 'sentiment'],
    num_rows: 25000
})


####Now we have two separate datasets, one containing only positive reviews, and another containing only negative reviews.

####We will take a subset of these two new datasets and form a new dict dataset

In [5]:
# Reduce the dataset size for a fast training example

# NOTE: the use of the shuffle method with specified seed is for reproducibility

imdb_pos_sample = imdb_pos.shuffle(seed=42).select(range(2500))
imdb_neg_sample = imdb_neg.shuffle(seed=42).select(range(2500))

In [11]:
print(imdb_pos_sample)
print(imdb_neg_sample)

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 2500
})
Dataset({
    features: ['review', 'sentiment'],
    num_rows: 2500
})


In [6]:
from datasets import DatasetDict, concatenate_datasets

data_dict = {
    "train": concatenate_datasets([imdb_pos_sample, imdb_neg_sample])
}

dataset_dict = DatasetDict(data_dict)

In [13]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 5000
    })
})

####We will use bert-base-cased from Huggingface as a model. The first step is to tokenize the reviews (keep in mind how a Large Language Model works)

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [8]:
tokenized_dataset = dataset_dict.map(tokenize_function, batched=True)
tokenized_dataset

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
})

####Now that we have the input in the right format, we can remove the reviews column (in general, it is a good idea to remove the columns that are no longer useful, even if it is possible to specify the input columns when training/evaluating the model)

In [9]:
tokenized_dataset = tokenized_dataset.remove_columns(['review'])
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentiment', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5000
    })
})

####We need the label to be numeric and not categorical

In [10]:
def encoding_function(example):
    return {"labels": 1 if example["sentiment"] == 'positive' else 0}

tokenized_dataset = tokenized_dataset.map(encoding_function)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [18]:
# An example of encoded labels
tokenized_dataset['train'].shuffle()[:10]['labels']

[1, 0, 1, 0, 1, 1, 0, 0, 0, 1]

#### Now we can remove also the sentiment column

In [11]:
tokenized_dataset = tokenized_dataset.remove_columns(['sentiment'])
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5000
    })
})

####Now we can divide the dataset into train, validation and test split

In [12]:
dataset = tokenized_dataset
dataset = dataset["train"].train_test_split(train_size=0.7, seed=42)
dataset['test'] = dataset['test'].train_test_split(train_size=0.4, seed=42)

dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3500
    })
    test: DatasetDict({
        train: Dataset({
            features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
            num_rows: 600
        })
        test: Dataset({
            features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
            num_rows: 900
        })
    })
})

####We have the 3 datasets but in a wrong structure, to fix it:

In [26]:
dataset["validation"] = dataset["test"].pop("train")
dataset["test"] = dataset["test"].pop("test")

dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3500
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 900
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 600
    })
})

####The alternative is to extract the 3 datasets separately and create a DatasetDict

In [13]:
dataset = DatasetDict({
    "train": dataset["train"],
    "validation": dataset["test"]["train"],
    "test": dataset["test"]["test"]
})

dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3500
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 600
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 900
    })
})

####Now we will proceed to fine-tuning the model to adapt it to the particular task. It will be sufficient to train it for a small number of epochs to have important performances, but you can try with a larger number (you make a time/complexity trade off)

In [18]:
from transformers import DataCollatorWithPadding
import numpy as np

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_train_dataset = dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = dataset["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

####PolynomialDecay is a way of linearly decreasing the learning rate with each weight update. We start with a higher learning rate and end with a low learning rate in order to allow the model to improve gradually, partially avoiding the problems of vanishing and exploding gradients and approaching an optimum in a short time (considering the few epochs). You can try different techniques.

In [20]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam

batch_size = 8
num_epochs = 3
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied by the total number of epochs.
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

opt = Adam(learning_rate=lr_scheduler)

####As a model we will use classic BERT, but you can try to use another. Remember that if you use another model you must use the right tokenizer.

In [None]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

checkpoint = "bert-base-cased"

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

In [22]:
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x78000570de10>

####Now evaluate the model on test set

In [23]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from keras.models import load_model

tf_test_dataset = dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

y_test = dataset["test"]["labels"]

y_pred = model.predict(tf_test_dataset)["logits"]

y_pred_binary = np.argmax(y_pred, axis=1)


accuracy = accuracy_score(y_test, y_pred_binary)
print("Accuracy:", accuracy)


f1 = f1_score(y_test, y_pred_binary)
print("F1-Score:", f1)


precision = precision_score(y_test, y_pred_binary)
print("Precision:", precision)


recall = recall_score(y_test, y_pred_binary)
print("Recall:", recall)


Accuracy: 0.8777777777777778
F1-Score: 0.8729792147806005
Precision: 0.8729792147806005
Recall: 0.8729792147806005


####As we can see, with only a few epochs we achieved excellent performances.

####In conclusion, this is all you need to know in practice to use a Large Language Model for text classification. I recommend exploring different models that you can find on Hugging Face (https://huggingface.co/)