## Checking if a GPU is available and selecting device accordingly

In [5]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} for training.")

Using cuda for training.


# 1. Importing necessary libraries

In [7]:
import numpy as np
import pandas as pd
from datasets import load_dataset, DatasetDict, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import f1_score, precision_score, recall_score
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

# 2. Loading the dataset

In [9]:
ds = load_dataset('imdb')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [10]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

# 3. Data Preparation:

In [11]:
ds['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [12]:
ds_train = pd.DataFrame(ds['train'])
ds_test = pd.DataFrame(ds['test'])

### this dataset includes a text review from imdb, along with a probably tumbs up or down (mapped to labels 0-1)

In [13]:
ds_train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [14]:
ds_train['label'].value_counts()

label
0    12500
1    12500
Name: count, dtype: int64

In [15]:
train = Dataset.from_pandas(ds_train)
test = Dataset.from_pandas(ds_test)
# reconstructing both datasets into a Dataset Dict object
new_ds = DatasetDict(
    {
        'train': train,
        'test': test
    }
)
# viewing the resulting dataset dict object
new_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})

In [12]:
ds = new_ds

# 4. Initializing Tokenizer


## The Tokenizer will:

### Convert the text into numerical representations
### Add special tokens to indicate the beginning and end of sentences (CLS and SEP)


### After text is tokenized and transformed into tensors, they must be the same length as model inputs must be uniform. To ensure that all sentences are the same length, we need to pad or * truncate* each of the sentences.

### since the datadic is seperate test_data and train_ data we can tokenize them as ds at the same time

In [16]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
   # Maping function
    # padding and truncation control for variable length sequences
    return tokenizer(examples["text"], padding="max_length", truncation=True,
                     max_length=512,return_tensors="pt")

# applying to all datasets with .map(). Built in function of the HF datasets class
tokenized_datasets = ds.map(tokenize_function, batched=True)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

# 6-Loading and choosing the model (here Distillbert)

### -BERT (Bidirectional Encoder Representations from Transformers): Developed by Google, it is designed to understand the context of a word in search queries.
###          DistilBERT is a smaller, faster, cheaper, and lighter version of BERT

### -this dataset includes a text review from imdb, along with a probably tumbs up or down (num_labels are 2 , mapped to labels 0-1)
### -A classification layer on top of BERT is added, since sentiment analysis is a classification task.


 The input IDs represent the tokenized version of the sentence, where each word has been converted into a token. However, the tokenized sequence may include special tokens such as [CLS], [SEP], and padding tokens.

1-[CLS] (Classification Token):
[CLS] stands for "classification" and is used to represent the beginning of a sentence or text sequence.

2-[SEP] (Separator Token):
[SEP] stands for "separator" and is used to separate two sequences in tasks that involve multiple text inputs, such as sentence-pair classification or question-answering tasks.

 example = " [CLS] This movie was fantastic! [SEP] "

In [17]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In each epoch, the model iterates through all the batches, computes the loss for each batch, and updates the model's parameters based on the gradients of the loss function with respect to those parameters.

After completing one epoch, the model has seen and been trained on each example in the training dataset once. The number of epochs is a hyperparameter that you specify before training begins, indicating how many times the model should iterate through the entire dataset.

# 7. Seting up Training Arguments



By default, the model is not evaluated during training (the loss function isn't accuracy, its something like cross-entropy) - we need to be able to pass our Trainer function an evaluation function to have an interpretable way to see what we're doing

In [4]:
training_args = TrainingArguments(
    output_dir='/content/my_model',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    push_to_hub=False
)

# 8. Defining Metrics for Evaluation

In [19]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    acc = accuracy_score(labels, predictions)

    return {
        "accuracy": acc,
        "f1": f1,
        "precision": precision,
        "recall": recall
        }

In [20]:
train = train.map(tokenize_function, batched=True)
test = test.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [21]:
new_ds = DatasetDict({
    'train': train,
    'test': test
})

In [22]:
train_dataset = new_ds["train"].shuffle(seed=42)
eval_dataset = new_ds["test"].shuffle(seed=42)

# 9. Seting Up the Trainer

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
500,0.3126
1000,0.2436
1500,0.2275


TrainOutput(global_step=1563, training_loss=0.25772921061256493, metrics={'train_runtime': 1182.6386, 'train_samples_per_second': 21.139, 'train_steps_per_second': 1.322, 'total_flos': 3311684966400000.0, 'train_loss': 0.25772921061256493, 'epoch': 1.0})

In [27]:
evaluation_results = trainer.evaluate()
evaluation_results

{'eval_loss': 0.19659632444381714,
 'eval_accuracy': 0.92632,
 'eval_f1': 0.9271936758893281,
 'eval_precision': 0.916328125,
 'eval_recall': 0.93832,
 'eval_runtime': 411.6614,
 'eval_samples_per_second': 60.73,
 'eval_steps_per_second': 3.797,
 'epoch': 1.0}

# 10. Saving them in huggingface 

In [28]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

we need git-lfs due to the size of the model and its assets

In [29]:
!apt-get install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [31]:
repo_name = "LLM_project"
training_args = TrainingArguments(
   output_dir=repo_name,
   push_to_hub=True,
)