### 0. Install Libraries

In [1]:
!pip install -U transformers
!pip install sentencepiece

# Restart kernel after installation: Runtime -> Restart runtime

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b5/d5/c6c23ad75491467a9a84e526ef2364e523d45e2b0fae28a7cbe8689e7e84/transformers-4.8.1-py3-none-any.whl (2.5MB)
[K     |▏                               | 10kB 23.4MB/s eta 0:00:01[K     |▎                               | 20kB 29.3MB/s eta 0:00:01[K     |▍                               | 30kB 34.9MB/s eta 0:00:01[K     |▌                               | 40kB 27.0MB/s eta 0:00:01[K     |▋                               | 51kB 28.7MB/s eta 0:00:01[K     |▉                               | 61kB 30.7MB/s eta 0:00:01[K     |█                               | 71kB 32.0MB/s eta 0:00:01[K     |█                               | 81kB 29.0MB/s eta 0:00:01[K     |█▏                              | 92kB 30.6MB/s eta 0:00:01[K     |█▎                              | 102kB 31.5MB/s eta 0:00:01[K     |█▌                              | 112kB 31.5MB/s eta 0:00:01[K     |█▋                              | 

In [5]:
from pathlib import Path
from sklearn.model_selection import train_test_split
import tensorflow as tf
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

### 1. Download and load dataset

In [1]:
# Download dataset
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2021-06-27 10:17:36--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-06-27 10:17:39 (28.5 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [2]:
!ls ac*/

imdbEr.txt  imdb.vocab	README	test  train


In [3]:
!ls ac*/train/

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [4]:
!cat ac*/train/pos/0_9.txt 

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

In [6]:
def read_imdb_split(split_dir):
    """Helper function to read text from txt files located in 
    `split_dir/pos/*.txt` or `split_dir/neg/*.txt`

    @param split_dir: path to train or test directory that contains both pos and neg subdirectory. 

    @returns texts: List of str where each element is a feature (text)
    @returns labels: List of int where each element is a label (positive:1, negative: 0)
    """

    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ['pos', 'neg']:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(1 if label_dir == 'pos' else 0)

    return texts, labels

train_texts, train_labels = read_imdb_split("aclImdb/train")
test_texts, test_labels = read_imdb_split("aclImdb/test")

In [7]:
print(f"Train size: {len(train_texts)} | Test size: {len(test_texts)}")
print(f"Train size: {len(train_labels)} | Test size: {len(test_labels)}")

Train size: 25000 | Test size: 25000
Train size: 25000 | Test size: 25000


### 2. Split and tokenize dataset

In [8]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [9]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [10]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [11]:
print(train_encodings.keys())

# Each key's value is a list of list. Something like:
# 'input_ids': [[1,2,3], [4,5,6]]
# Refer to the __getitem__ method in the IMDbDataset subclass to see how to access to each element individually.

dict_keys(['input_ids', 'attention_mask'])


### 3. Create tf Dataset

In [12]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels)) 
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels))        
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), test_labels))    

### 4. Prepare training arguments and start training

In [14]:
training_args = TFTrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5,
    num_train_epochs=3,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    #logging_steps=10,
    save_strategy="epoch",
    logging_strategy="epoch",
)

with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")


trainer = TFTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_projector', 'vocab_transform', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use i



**Training completed. Do not forget to share your model on huggingface.co/models =)**

```
TrainOutput(global_step=7500, training_loss=0.20122720197954524, metrics={'train_runtime': 3150.3136, 'train_samples_per_second': 19.046, 'train_steps_per_second': 2.381, 'total_flos': 1.23411474432e+16, 'train_loss': 0.20122720197954524, 'epoch': 3.0})```

### 5. Predicts on Test set

In [15]:
preds = trainer.predict(test_dataset)



In [16]:
predictions = preds[0].argmax(-1)

from sklearn.metrics import classification_report

print(classification_report(preds[1], # labels from test_dataset
                            predictions, 
                            ))

              precision    recall  f1-score   support

           0       0.93      0.93      0.93     12500
           1       0.93      0.93      0.93     12500

    accuracy                           0.93     25000
   macro avg       0.93      0.93      0.93     25000
weighted avg       0.93      0.93      0.93     25000



### 6. Save model

In [17]:
trainer.save_model("./models")

In [18]:
!ls models/

#config.json tf_model.h5

config.json  tf_model.h5


### 7. Inference using pipeline

In [19]:
from transformers import pipeline

review_pipeline = pipeline("text-classification", model="./models", tokenizer=tokenizer, return_all_scores=True)

Some layers from the model checkpoint at ./models were not used when initializing TFDistilBertForSequenceClassification: ['dropout_39']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./models and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
positive_test_case = "Awesome movie. Love it so much!"

predictions = review_pipeline(positive_test_case)

predictions

#[[{'label': 'LABEL_0', 'score': 0.0015208885306492448},
#  {'label': 'LABEL_1', 'score': 0.9984791278839111}]]

[[{'label': 'LABEL_0', 'score': 0.0015208885306492448},
  {'label': 'LABEL_1', 'score': 0.9984791278839111}]]

In [21]:
negative_test_case = "Bad movie and storyline. I hate it so much!"

predictions = review_pipeline(negative_test_case)

predictions
#[[{'label': 'LABEL_0', 'score': 0.9911043643951416},
# {'label': 'LABEL_1', 'score': 0.008895594626665115}]]

[[{'label': 'LABEL_0', 'score': 0.9911043643951416},
  {'label': 'LABEL_1', 'score': 0.008895594626665115}]]

### Inference using model.from_pretrained()

In [23]:
reload_model = TFDistilBertForSequenceClassification.from_pretrained("./models")

Some layers from the model checkpoint at ./models were not used when initializing TFDistilBertForSequenceClassification: ['dropout_39']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./models and are newly initialized: ['dropout_79']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
test_cases = [positive_test_case, negative_test_case]
encoded_test_cases = tokenizer(test_cases, truncation=True, padding=True, return_tensors='tf')

outputs = reload_model(encoded_test_cases)

In [31]:
predictions = tf.argmax(outputs.logits, axis=-1)

In [32]:
for sent, pred in zip(test_cases, predictions):
    print(f"Sentence: {sent} | Predicted: {'Positive' if pred == 1 else 'Negative'}")

Sentence: Awesome movie. Love it so much! | Predicted: Positive
Sentence: Bad movie and storyline. I hate it so much! | Predicted: Negative


### Reference


1.   Tokenizer: https://huggingface.co/transformers/main_classes/tokenizer.html
2.   Pipeline: https://huggingface.co/transformers/main_classes/pipelines.html#the-pipeline-abstraction
3.   Load model after trainer.train(): https://discuss.huggingface.co/t/how-to-test-my-text-classification-model-after-training-it/6689/2
4.   Fine-tuning with custom datasets: https://huggingface.co/transformers/master/custom_datasets.html
5.   trainer.predict(): https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer.predict
