In [1]:
from dermclass_models2.preprocessing import TextPreprocessors
from dermclass_models2.config import TextConfig

from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification, DistilBertTokenizer

import tensorflow as tf

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
#sess = tf.compat.v1.Session(config=config)

In [2]:
# preprocessing
pp = TextPreprocessors(TextConfig)
df = pp._load_class_from_dir(TextConfig.DATA_PATH / "lichen_planus")

df['encoded_cat'] = df['target'].astype('category').cat.codes

data_texts = df["text"].to_list() # Features (not-tokenized yet)
data_labels = df["encoded_cat"].to_list() # Lables

# Split Train and Validation data
train_texts, val_texts, train_labels, val_labels = train_test_split(data_texts, data_labels, test_size=0.2, random_state=0)

# Keep some data for inference (testing)
train_texts, test_texts, train_labels, test_labels = train_test_split(train_texts, train_labels, test_size=0.01, random_state=0)

2020-12-06 13:59:22,405 — dermclass_models2.preprocessing — INFO —_load_class_from_dir:264 — Successfully loaded class lichen_planus


In [3]:
# Pipeline
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

In [4]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.batch(16), epochs=2)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'vocab_transform', 'activation_13', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_19', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x20a07d59048>

In [1]:
from dermclass_models2.pipeline import TextPipeline
from dermclass_models2.preprocessing import TextPreprocessors
tp = TextPreprocessors()
tm = TextPipeline()
train_dataset_dermclass, validation_dataset_dermclass, test_dataset_dermclass = tp.load_data(get_datasets=True)
tm.fit_datasets(train_dataset_dermclass, validation_dataset_dermclass, test_dataset_dermclass)

Found 22 files belonging to 3 classes.
Using 18 files for training.
Found 22 files belonging to 3 classes.
Using 4 files for validation.
2020-12-06 14:12:01,584 — dermclass_models2.preprocessing — INFO —_load_dataset:301 — Successfully loaded train and validation datasets 
2020-12-06 14:12:01,590 — dermclass_models2.preprocessing — INFO —_split_train_test_tf:120 — Successfully prefetched train, test and validation datasets
2020-12-06 14:12:01,592 — dermclass_models2.preprocessing — INFO —_split_train_test_tf:123 — Number of train batches: 6        Number of validation batches: 1        Number of test batches: 1


In [2]:
transformers_modeling_pipeline = tm.get_modeling_pipeline(use_sklearn=False)

2020-12-06 14:12:02,521 — dermclass_models2.pipeline — INFO —get_processing_pipeline:442 — Successfully loaded processing pipeline


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_layer_norm', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

2020-12-06 14:12:04,228 — dermclass_models2.pipeline — INFO —_compile_model:206 — Successfully compiled tf model
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
2020-12-06 14:12:26,276 — dermclass_models2.pipeline — INFO —_train_model:230 — Successfully trained tf model
2020-12-06 14:12:26,277 — dermclass_models2.pipeline — INFO —get_modeling_pipeline:469 — Successfully loaded modeling pipeline


In [10]:
test_dataset_dermclass_encoded = transformers_modeling_pipeline.processing_pipeline(test_dataset_dermclass, transformers_modeling_pipeline.tokenizer)

In [17]:
results = transformers_modeling_pipeline.model.evaluate(test_dataset_dermclass_encoded.batch(3))



In [18]:
results

[1.0754534006118774, 0.3333333432674408]

In [6]:
processing_pipeline = tm.get_processing_pipeline(False)

2020-12-06 13:59:37,562 — dermclass_models2.pipeline — INFO —get_processing_pipeline:433 — Successfully loaded processing pipeline


In [7]:
train_dataset_dermclass_encoded, validation_dataset_dermclass_encoded = (processing_pipeline(train_dataset_dermclass, tm.tokenizer),
                                                                         processing_pipeline(validation_dataset_dermclass, tm.tokenizer))

In [8]:
model.fit(train_dataset.batch(3), epochs=2)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x20a0fa77588>

In [9]:
model.fit(train_dataset_dermclass_encoded.batch(3), epochs=2, validation_data=validation_dataset_dermclass_encoded.batch(3))

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x20a7fbc6a88>

In [10]:
transformers_modeling_pipeline = tm.get_modeling_pipeline(use_sklearn=False)

2020-12-06 13:59:45,872 — dermclass_models2.pipeline — INFO —get_processing_pipeline:433 — Successfully loaded processing pipeline


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'vocab_transform', 'activation_13', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use i

2020-12-06 13:59:47,405 — dermclass_models2.pipeline — INFO —_compile_model:203 — Successfully compiled tf model
Epoch 1/5


ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[3,512,12,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node tf_distil_bert_for_sequence_classification_1/distilbert/transformer/layer_._5/attention/transpose_3 (defined at C:\Users\Kajetan\Anaconda3\envs\dermclass_models2\lib\site-packages\transformers\models\distilbert\modeling_tf_distilbert.py:228) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[gradient_tape/tf_distil_bert_for_sequence_classification_1/distilbert/embeddings/position_embeddings/embedding_lookup/Reshape/_284]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[3,512,12,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node tf_distil_bert_for_sequence_classification_1/distilbert/transformer/layer_._5/attention/transpose_3 (defined at C:\Users\Kajetan\Anaconda3\envs\dermclass_models2\lib\site-packages\transformers\models\distilbert\modeling_tf_distilbert.py:228) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_29144]

Errors may have originated from an input operation.
Input Source operations connected to node tf_distil_bert_for_sequence_classification_1/distilbert/transformer/layer_._5/attention/transpose_3:
 tf_distil_bert_for_sequence_classification_1/distilbert/transformer/layer_._5/attention/MatMul_1 (defined at C:\Users\Kajetan\Anaconda3\envs\dermclass_models2\lib\site-packages\transformers\models\distilbert\modeling_tf_distilbert.py:249)

Input Source operations connected to node tf_distil_bert_for_sequence_classification_1/distilbert/transformer/layer_._5/attention/transpose_3:
 tf_distil_bert_for_sequence_classification_1/distilbert/transformer/layer_._5/attention/MatMul_1 (defined at C:\Users\Kajetan\Anaconda3\envs\dermclass_models2\lib\site-packages\transformers\models\distilbert\modeling_tf_distilbert.py:249)

Function call stack:
train_function -> train_function
