<a href="https://colab.research.google.com/github/Kojo7/MT-Preparation/blob/main/ML_NLP_IMDB_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install comet_ml torch datasets transformers scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting comet_ml
  Downloading comet_ml-3.31.16-py3-none-any.whl (418 kB)
[K     |████████████████████████████████| 418 kB 5.3 MB/s 
Collecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 56.2 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 16.5 MB/s 
Collecting websocket-client<1.4.0,>=0.55.0
  Downloading websocket_client-1.3.3-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 978 kB/s 
[?25hCollecting dulwich!=0.20.33,>=0.20.6
  Downloading dulwich-0.20.50-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (498 kB)
[K     |████████████████████████████████| 498 kB 57.7 MB/s 
[?25hCollecting wurlitzer>=1.0.2
  Downloading wurlitzer-3.0.2-py3-none-any.whl (7.3 kB)
Collecting requests-to

In [2]:
import comet_ml

comet_ml.init(project_name="imdb-distilbert")


Please enter your Comet API key from https://www.comet.com/api/my/settings/
(api key may not show as you type)
Comet API key: ··········


COMET INFO: Comet API key is valid
COMET INFO: Comet API key saved in /root/.comet.config


***Set Model Type***

In [3]:
Model_name = "distilbert-base-uncased"
SEED = 20

***Load Data***

In [4]:
from transformers import AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

***Setup Tokenizer***

In [5]:
tokenizer = AutoTokenizer.from_pretrained(Model_name)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
def tokenizer_function(examples):
  return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenizer_function, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [7]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

***Create Sample Datasets***

In [8]:
train_datasets = tokenized_datasets['train'].shuffle(SEED).select(range(1000))
eval_datasets = tokenized_datasets['test'].shuffle(SEED).select(range(1000))

***Setup Transformer Model***

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    Model_name, 
    num_labels=2,
)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

#***Setup Evaluation Function***

In [10]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def get_examples(index):
  return eval_datasets[index]["text"]

def compute_metrics(pred):
  experiment = comet_ml.get_global_experiment()
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  precision, recall, f1, _ = precision_recall_fscore_support (
      labels, preds, average="macro"
  ) 
  acc = accuracy_score(labels, preds)

  if experiment:
    epoch = int(experiment.curr_epoch) if experiment.curr_epoch is not None else 0
    experiment.set_epoch(epoch)
    experiment.log_confusion_matrix (
        y_true = labels,
        y_predicted = preds,
        file_name = f"confusion-matrix-epoch-{epoch}.json",
        labels = ["negative", "positive"],
        index_to_example_function = get_examples,
    )
  for i in range(20):
    experiment.log_text(get_examples(i), metadata={"label": labels[i].item()})
  
  return {"accuracy":acc, "f1":f1, "precision":precision, "recall":recall}

#***Run Training***

In [11]:
%env COMET_MODE=ONLINE
%env COMET_LOG_ASSETS=TRUE

training_args = TrainingArguments(
    seed = SEED,
    output_dir = "./results",
    overwrite_output_dir = True,
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=25,
    save_strategy="steps",
    save_total_limit=10,
    save_steps=25,
    per_device_train_batch_size=8
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_datasets,
    eval_dataset=eval_datasets,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

trainer.train()

env: COMET_MODE=ONLINE
env: COMET_LOG_ASSETS=TRUE


The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 125
  Number of trainable parameters = 66955010
COMET ERROR: Failed to calculate active processors count. Fall back to default CPU count 1
COMET INFO: Couldn't find a Git repository in '/content' nor in any parent directory. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY`
COMET INFO: Experiment is live on comet.com https://www.comet.com/nakenzy/imdb-distilbert/c030bcdcb8174811880e06c5e5945458

Automatic Comet.ml online loggin

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
25,No log,0.542381,0.827,0.825071,0.834758,0.824484
50,No log,0.419555,0.819,0.818279,0.82976,0.821858
75,No log,0.383717,0.845,0.84481,0.850276,0.846987
100,No log,0.373842,0.85,0.849654,0.857912,0.852409
125,No log,0.330066,0.87,0.869284,0.873165,0.868521


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to ./results/checkpoint-25
Configuration saved in ./results/checkpoint-25/config.json
Model weights saved in ./results/checkpoint-25/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to ./results/checkpoint-50
Configuration saved in ./results/checkpoint-50/config.json
Model weights saved in ./results/ch

TrainOutput(global_step=125, training_loss=0.44688397216796877, metrics={'train_runtime': 164.5346, 'train_samples_per_second': 6.078, 'train_steps_per_second': 0.76, 'total_flos': 132467398656000.0, 'train_loss': 0.44688397216796877, 'epoch': 1.0})