In [1]:
from datasets import load_dataset
# imdb = load_dataset("imdb")


In [2]:
data_files = {'train': 'data/imdb_train.csv', "test" : 'data/imdb_val.csv'}
imdb = load_dataset("csv", data_files=data_files)
imdb

Downloading and preparing dataset csv/default to /Users/emiliagenadieva/.cache/huggingface/datasets/csv/default-5a80cf620f54de00/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /Users/emiliagenadieva/.cache/huggingface/datasets/csv/default-5a80cf620f54de00/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['texts', 'labels'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['texts', 'labels'],
        num_rows: 25000
    })
})

In [3]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])


In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


In [6]:
def preprocess_function(examples):
   return tokenizer(examples["texts"], truncation=True)
 
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

In [7]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [8]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier

In [9]:
import numpy as np
import evaluate
 
def compute_metrics(eval_pred):
   load_accuracy = evaluate.combine(["accuracy"])
   load_f1 = evaluate.combine(["f1"])

   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}


In [10]:
from transformers import TrainingArguments, Trainer
 
repo_name = "finetuning-sentiment-model-3000-samples"
 
training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
)
 
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


In [11]:
trainer.train()



  0%|          | 0/376 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'train_runtime': 3522.455, 'train_samples_per_second': 1.703, 'train_steps_per_second': 0.107, 'train_loss': 0.305895744486058, 'epoch': 2.0}


TrainOutput(global_step=376, training_loss=0.305895744486058, metrics={'train_runtime': 3522.455, 'train_samples_per_second': 1.703, 'train_steps_per_second': 0.107, 'train_loss': 0.305895744486058, 'epoch': 2.0})

In [12]:
trainer.evaluate()

  0%|          | 0/19 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

{'eval_loss': 0.25247812271118164,
 'eval_accuracy': 0.9,
 'eval_f1': 0.9050632911392406,
 'eval_runtime': 36.1795,
 'eval_samples_per_second': 8.292,
 'eval_steps_per_second': 0.525,
 'epoch': 2.0}

In [14]:
#text = "A successful artist looks back with loving memories on the summer of his defining year, 1974. A talented but troubled 18-year-old aspiring artist befriends a brilliant elderly alcoholic painter who has turned his back on not only art but life. The two form what appears to be at first a tenuous relationship. The kid wants to learn all the secrets the master has locked away inside his head and heart. Time has not been kind to the old master. His life appears pointless to him until the kid rekindles his interest in his work and ultimately gives him the will to live. Together, they give one another a priceless gift. The kid learns to see the world through the master's eyes. And the master learns to see life through the eyes of innocence again. This story is based on a real life experience."
import tensorflow as tf

text = 'The movie was very enjoyable. Costner is perfect as the aging macho man. Predictable in some parts sure, but never boring. All of the other military branches have had love notes written about them and seen their recruitment levels go up, why not the Coast Guard too? They are definitely under-appreciatedI would suggest this movie for anyone to see.And she replies, "No, you do not know where anything is in this house; I should be the one to go." This does not make sense: If she knows the layout so well, Costner is right, he *should* be the one to leave. The ending broke my heart but I know why he did it. The storyline was great I give it 2 thumbs up. I cried it was very emotional, I would give it a 20 if I could! My boyfriend and I went to watch The Guardian. At first I did not want to watch it, but I loved the movie- It was definitely the best movie I have seen in sometime. They portrayed the USCG very well, it really showed me what they do and I think they should really be appreciated more.Not only did it teach but it was a really good movie. The movie shows what the really do and how hard the job is. I think being a USCG would be challenging and very scary.'
#text = 'Toward the end of World War II, middle-aged soldier Keita is entrusted with a postcard from a comrade who is sure he will die in battle. After the war ends, Keita visits his comrade wife Yuko and bears witness to the tragic life she has led. This year Oscar entry from Japan finds SHINDO in top form and his 49th and reportedly last film as fresh and poignant as ever.'
encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)

logits = outputs.logits
probs = logits.softmax(dim=-1).detach().cpu().flatten().numpy().tolist()
print(probs)

[0.01887666992843151, 0.9811232686042786]


In [16]:
model.config.id2label

{0: 'LABEL_0', 1: 'LABEL_1'}

In [15]:
ids = np.argmax(probs, axis=-1)
print('Label',model.config.id2label[ids])

Label LABEL_1


### Conclusion:</br>
The model predicted correctly that the sentence is from the positive reviews.