[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dhupee/Bangkit-C22CB-Company-Based-Capstone/blob/30b0995970f29114749cff04deef444de6832993/ML/distilbert_transfer_learn.ipynb)

In [2]:
# check python version
import sys
print(sys.version)

3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]


In [3]:
# notebook settings

COLAB_MODE = False # set to True if running in Google Colab
ENABLE_JSON2CSV = False # set to True if you want to convert json dataset to csv

In [4]:
# if COLAB_MODE is True, then work around the repository
if COLAB_MODE:
    import os
    branch_name = 'dhupee-dev'
    cloned_repo_name = 'remote-clone'
    target_repo_dir = '/content/remote-clone/ML'
    repo_link = 'https://github.com/dhupee/Bangkit-C22CB-Company-Based-Capstone.git'
    # if current directory is not the cloned repo, clone it
    if not os.path.exists(target_repo_dir):
        !git clone --single-branch --branch $branch_name $repo_link $cloned_repo_name
        print('Repo successfully cloned!')
        %cd $target_repo_dir
        %pwd
    else:
        print('Repo already cloned')

In [5]:
# check if transformers and tensorflow are installed, if not install them
# use transformers version 4.18.0 and tensorflow version 2.8.0
try:
    import transformers
    import tensorflow as tf
    print("transformers and tensorflow are installed")
except:
    print("transformers and tensorflow are not installed")
    print("installing transformers and tensorflow")
    # install transformers 4.18.0 and tensorflow 2.8.0
    %pip install transformers==4.18.0
    %pip install tensorflow==2.8.0
    # import transformers and tensorflow again
    import transformers
    import tensorflow as tf

  from .autonotebook import tqdm as notebook_tqdm


transformers and tensorflow are installed


In [6]:
model_name = "cahya/bert-base-indonesian-522M"
batch_size = 32

from transformers import BertTokenizer, TFAutoModel # make sure use tensorflow model
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name)

Some layers from the model checkpoint at cahya/bert-base-indonesian-522M were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at cahya/bert-base-indonesian-522M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [7]:
model.summary()

Model: "tf_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  110617344 
                                                                 
Total params: 110,617,344
Trainable params: 110,617,344
Non-trainable params: 0
_________________________________________________________________


In [8]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizer)

In [9]:
# test tokenizer
tokenizer("Nama kamu siapa?")

{'input_ids': [3, 1769, 8343, 6186, 32, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [10]:
tokenizer("saya suka makan nasi goreng")

The history saving thread hit an unexpected error (OperationalError('disk I/O error')).History will not be written to the database.


{'input_ids': [3, 3245, 5366, 2464, 6014, 11186, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [11]:
unmasker = transformers.pipeline('fill-mask', model = model_name)
unmasker("mainan saya [MASK] di jalan")

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at cahya/bert-base-indonesian-522M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


[{'score': 0.0840364545583725,
  'token': 2186,
  'token_str': 'berada',
  'sequence': 'mainan saya berada di jalan'},
 {'score': 0.07038316130638123,
  'token': 1821,
  'token_str': 'ada',
  'sequence': 'mainan saya ada di jalan'},
 {'score': 0.0403575636446476,
  'token': 1998,
  'token_str': 'sendiri',
  'sequence': 'mainan saya sendiri di jalan'},
 {'score': 0.029048316180706024,
  'token': 2444,
  'token_str': 'lahir',
  'sequence': 'mainan saya lahir di jalan'},
 {'score': 0.028137197718024254,
  'token': 3812,
  'token_str': 'berdiri',
  'sequence': 'mainan saya berdiri di jalan'}]

In [12]:
# load dataset json file
import json

train_json_dir = "Translated/train-v2.0_indo.json"
dev_json_dir = "Translated/dev-v2.0_indo.json"
tester_json_dir  = "Translated/tester_indo.json"

dataset_dirs = [train_json_dir, dev_json_dir, tester_json_dir]
# dataset_dirs = [tester_json_dir]

In [13]:
'''
    This function is for converting SQuAD json file to pandas dataframe, iteratively

    I dont want run this locally, better use colab
'''

if ENABLE_JSON2CSV:
    import utils
    for dir in dataset_dirs:
        with open(dir, encoding="utf-8") as json_file:
            file = json.load(json_file)
            dict_file = file
            data = dict_file['data']

        df = utils.json_to_df(data)
        df.to_csv(dir.replace(".json", ".csv"), index = False)

---

In [14]:
max_length = 384  # The maximum length of a feature (question and context)
doc_stride = 128  # The allowed overlap between two part of the context when splitting is performed.

In [15]:
try:
    import pandas as pd
except:
    print("pandas is not installed")
    print("installing pandas...")
    %pip install pandas
    import pandas as pd

train_csv_dir = "Translated/train-v2.0_indo.csv"
dev_csv_dir = "Translated/dev-v2.0_indo.csv"
tester_csv_dir  = "Translated/tester_indo.csv"

# open one of the csv file
df_tester = pd.read_csv(tester_csv_dir)
df_tester.head()

Unnamed: 0,input_id,title,context,question,answer_text,answer_start
0,56ddde6b9a695914005b9628,orang Normandia,"Gaya pound memiliki padanan metrik, lebih jara...",Di negara manakah Normandia berada?,Perancis,159.0
1,56ddde6b9a695914005b9628,orang Normandia,"Gaya pound memiliki padanan metrik, lebih jara...",Di negara manakah Normandia berada?,Perancis,159.0
2,56ddde6b9a695914005b9628,orang Normandia,"Gaya pound memiliki padanan metrik, lebih jara...",Di negara manakah Normandia berada?,Perancis,159.0
3,56ddde6b9a695914005b9628,orang Normandia,"Gaya pound memiliki padanan metrik, lebih jara...",Di negara manakah Normandia berada?,Perancis,159.0
4,56ddde6b9a695914005b9629,orang Normandia,"Gaya pound memiliki padanan metrik, lebih jara...",Kapan orang Normandia di Normandia?,abad 10 dan 11,94.0


In [60]:
#check dataframe datatype
df_tester.dtypes

input_id         object
title            object
context          object
question         object
answer_text      object
answer_start    float64
dtype: object

In [33]:
csv_dirs = [train_csv_dir, dev_csv_dir, tester_csv_dir]

# check every csv file, if there is feature longer than max_length, then print it
for csv_dir in csv_dirs:
    df = pd.read_csv(csv_dir)
    longer_feature = 0
    for index, row in df.iterrows():
        if len(row[['context', 'question', 'input_id']]) > max_length:
            longer_feature += 1
            print(f"{csv_dir} has {longer_feature} features longer than {max_length}")
            print("\n")
    if longer_feature == 0:
        print(f"{csv_dir} has no features longer than {max_length}")
        print("\n")

Translated/train-v2.0_indo.csv has no features longer than 384


Translated/dev-v2.0_indo.csv has no features longer than 384


Translated/tester_indo.csv has no features longer than 384




In [63]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained(model_name)

All model checkpoint layers were used when initializing TFBertForQuestionAnswering.

Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at cahya/bert-base-indonesian-522M and are newly initialized: ['qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [83]:
learning_rate = 0.00001
num_train_epochs = 2
weight_decay = 0.01

In [84]:
# TODO: load preprocessed dataset
# TODO: train model
# TODO: evaluate model
# TODO: save model

In [87]:

# load preprocessed dataset
train_csv_dir = "Translated/train-v2.0_indo.csv"
dev_csv_dir = "Translated/dev-v2.0_indo.csv"

# load preprocessed dataset
df_train = pd.read_csv(train_csv_dir)
df_dev = pd.read_csv(dev_csv_dir)

def get_dataset(df, batch_size):
    """
    This function is for converting pandas dataframe to tf.data.Dataset
    """
    def _parse_function(input_id, question, context):
        input_id = tf.cast(input_id, tf.int32)
        question = tf.cast(question, tf.int32)
        context = tf.cast(context, tf.int32)
        return input_id, question, context

    dataset = tf.data.Dataset.from_tensor_slices((df['input_id'], df['question'], df['context']))
    dataset = dataset.map(_parse_function)
    dataset = dataset.batch(batch_size)
    return dataset

# train dataset
train_dataset = get_dataset(df_train, batch_size)

# dev dataset
dev_dataset = get_dataset(df_dev, batch_size)

# train model
def train_model(model, train_dataset, dev_dataset, learning_rate, num_train_epochs, weight_decay):
    """
    This function is for training model
    """
    optimizer = tf.keras.optimizers.Adam(
        learning_rate = learning_rate,
    )

    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

    for epoch in range(num_train_epochs):
        print(f"Epoch {epoch + 1}/{num_train_epochs}")
        print('-' * 10)

        for step, batch in enumerate(train_dataset):
            loss_value, gradients, gradient_norm = loss.get_loss_and_gradients(batch, model)
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))

            if step % 100 == 0:
                print(f"Step {step + 1}/{len(train_dataset)}, loss: {loss_value}")

        # evaluate model
        print("\nEvaluating model on dev set...")
        print('-' * 10)
        dev_loss, dev_accuracy = evaluate_model(model, dev_dataset)
        print(f"Dev set: loss: {dev_loss}, accuracy: {dev_accuracy}")
        print("\n")

    return model

# evaluate model
def evaluate_model(model, dataset):
    """
    This function is for evaluating model
    """
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    total_loss = 0
    num_batches = 0
    num_correct = 0

    for batch in dataset:
        loss_value, _, _ = loss.get_loss_and_gradients(batch, model)
        total_loss += loss_value
        num_batches += 1

        predictions = model(batch)
        num_correct += tf.math.count_nonzero(
            tf.math.equal(
                predictions.start_logits,
                tf.math.argmax(batch.start_positions, axis = -1)
            )
        )
        num_correct += tf.math.count_nonzero(
            tf.math.equal(
                predictions.end_logits,
                tf.math.argmax(batch.end_positions, axis = -1)
            )
        )

    average_loss = total_loss / num_batches
    accuracy = num_correct / (2 * len(dataset))

    return average_loss, accuracy

In [88]:
# fit model
model = train_model(model, train_dataset, dev_dataset, learning_rate, num_train_epochs, weight_decay)
model.fit()

Epoch 1/2
----------


: 

: 