[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dhupee/Bangkit-C22CB-Company-Based-Capstone/blob/30b0995970f29114749cff04deef444de6832993/ML/distilbert_transfer_learn.ipynb)

In [1]:
# check python version
import sys
print(sys.version)

3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]


In [2]:
# notebook settings

COLAB_MODE = False # set to True if running in Google Colab
ENABLE_JSON2CSV = False # set to True if you want to convert json dataset to csv, DOESNT NEEDED ANYMORE

In [3]:
# if COLAB_MODE is True, then work around the repository
if COLAB_MODE:
    import os
    branch_name = 'dhupee-dev'
    cloned_repo_name = 'remote-clone'
    target_repo_dir = '/content/remote-clone/ML'
    repo_link = 'https://github.com/dhupee/Bangkit-C22CB-Company-Based-Capstone.git'
    # if current directory is not the cloned repo, clone it
    if not os.path.exists(target_repo_dir):
        !git clone --single-branch --branch $branch_name $repo_link $cloned_repo_name
        print('Repo successfully cloned!')
        %cd $target_repo_dir
        %pwd
    else:
        print('Repo already cloned')

In [4]:
# check if transformers and tensorflow are installed, if not install them
# use transformers version 4.18.0 and tensorflow version 2.8.0
try:
    import transformers
    import tensorflow as tf
    print("transformers and tensorflow are installed")
except:
    print("transformers and tensorflow are not installed")
    print("installing transformers and tensorflow")
    # install transformers 4.18.0 and tensorflow 2.8.0
    %pip install transformers==4.18.0
    %pip install tensorflow==2.8.1
    # import transformers and tensorflow again
    import transformers
    import tensorflow as tf

transformers and tensorflow are installed


In [5]:
# loading base model

model_name = "cahya/bert-base-indonesian-522M"
batch_size = 32

from transformers import AutoTokenizer, TFAutoModel # make sure use tensorflow model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name) # if not specified, it will use torch model

Some layers from the model checkpoint at cahya/bert-base-indonesian-522M were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at cahya/bert-base-indonesian-522M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [6]:
model.summary()

Model: "tf_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  110617344 
                                                                 
Total params: 110,617,344
Trainable params: 110,617,344
Non-trainable params: 0
_________________________________________________________________


In [7]:
# test tokenizer
tokenizer("Nama kamu siapa?")

{'input_ids': [3, 1769, 8343, 6186, 32, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [8]:
tokenizer("saya suka makan nasi goreng")

{'input_ids': [3, 3245, 5366, 2464, 6014, 11186, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [9]:
# see how base model works
unmasker = transformers.pipeline('fill-mask', model = model_name)
unmasker("mainan saya [MASK] di jalan")

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at cahya/bert-base-indonesian-522M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


[{'score': 0.0840362161397934,
  'token': 2186,
  'token_str': 'berada',
  'sequence': 'mainan saya berada di jalan'},
 {'score': 0.07038316130638123,
  'token': 1821,
  'token_str': 'ada',
  'sequence': 'mainan saya ada di jalan'},
 {'score': 0.0403575673699379,
  'token': 1998,
  'token_str': 'sendiri',
  'sequence': 'mainan saya sendiri di jalan'},
 {'score': 0.029048344120383263,
  'token': 2444,
  'token_str': 'lahir',
  'sequence': 'mainan saya lahir di jalan'},
 {'score': 0.028137225657701492,
  'token': 3812,
  'token_str': 'berdiri',
  'sequence': 'mainan saya berdiri di jalan'}]

In [None]:
#this for converting the file to huggingface format
#I suggest you to do it seperately

def convert_huggingface(dataset_path):
    import json
    with open(dataset_path, encoding="utf8") as f:
        content = json.load(f)

    hf_data = []
    for data in content["data"]:
        title = data["title"]
        for paragraph in data["paragraphs"]:
            context = paragraph["context"]
            for qa in paragraph["qas"]:
                fill = {
                    "id": qa["id"],
                    "title": title,
                    "context": context,
                    "question": qa["question"],
                    "answers": {"answer_start": [], "text": []}
                }
                if qa["is_impossible"]:
                    answers = qa["plausible_answers"]
                else:
                    answers = qa["answers"]
                for answer in answers:
                    fill["answers"]["answer_start"].append(answer["answer_start"])
                    fill["answers"]["text"].append(answer["text"])

                hf_data.append(fill)
    # Add "_hf" before .json extension
    hf_dataset_path = dataset_path.replace(".json", "_hf.json")
    with open(hf_dataset_path, "w") as f:
        json.dump({"data": hf_data}, f)

    return hf_data

In [10]:
# load dataset json file
import json

train_json_dir = "Translated/hf_train-v2.0_indo.json"
valid_json_dir = "Translated/hf_dev-v2.0_indo.json"
tester_json_dir  = "Translated/tester_indo.json"

dataset_dirs = [train_json_dir, valid_json_dir, tester_json_dir]
# dataset_dirs = [tester_json_dir]

In [11]:
# importing dataset

try:
    from datasets import load_dataset
except:
    print("datasets module not found")
    print("installing dataset module")
    %pip install datasets
    from datasets import load_dataset

datasets = load_dataset(
    'json',
    data_files={'train': train_json_dir, 'validation': valid_json_dir},
    field='data'
    )

Using custom data configuration default-75619838477d768f


Downloading and preparing dataset json/default to C:\Users\asus\.cache\huggingface\datasets\json\default-75619838477d768f\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset json downloaded and prepared to C:\Users\asus\.cache\huggingface\datasets\json\default-75619838477d768f\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
# sample of dataset randomly
import random
datasets["train"][4]

{'id': '56bf6b0f3aeaaa14008c9602',
 'title': 'Beyoncé',
 'context': 'Beyoncé Giselle Knowles-Carter (/ biːˈjɒnseɪ / bee-YON-say) (lahir 4 September 1981) adalah seorang penyanyi, penulis lagu, produser rekaman dan aktris Amerika. Dilahirkan dan dibesarkan di Houston, Texas, ia tampil di berbagai kompetisi menyanyi dan menari sebagai seorang anak, dan mulai terkenal pada akhir 1990-an sebagai penyanyi utama dari grup wanita R&B Destiny\'s Child. Dikelola oleh ayahnya, Mathew Knowles, grup ini menjadi salah satu grup wanita terlaris di dunia sepanjang masa. Jeda mereka melihat perilisan album debut Beyoncé, Dangerously in Love (2003), yang menjadikannya sebagai artis solo di seluruh dunia, meraih lima Grammy Awards dan menampilkan single nomor satu Billboard Hot 100 "Crazy in Love" dan "Baby Boy" .',
 'question': 'Pada dekade berapa Beyonce menjadi terkenal?',
 'answers': {'answer_start': [304], 'text': ['akhir 1990-an']}}

In [13]:
# sample of dataset randomly
import random
datasets["validation"][4]

{'id': '56ddde6b9a695914005b962c',
 'title': 'orang Normandia',
 'context': 'Bangsa Norman (Norman: Nourmands; Prancis: Normandia; Latin: Normanni) adalah orang-orang yang pada abad ke-10 dan ke-11 memberi nama kepada Normandia, sebuah wilayah di Prancis. Mereka adalah keturunan dari perampok dan bajak laut Norse ("Norman" berasal dari "Norseman") dan bajak laut dari Denmark, Islandia dan Norwegia yang, di bawah pemimpin mereka Rollo, setuju untuk bersumpah setia kepada Raja Charles III dari Francia Barat. Melalui generasi asimilasi dan pencampuran dengan penduduk asli Franka dan Romawi-Gaul, keturunan mereka secara bertahap akan bergabung dengan budaya berbasis Carolingian di Francia Barat. Identitas budaya dan etnis yang berbeda dari Normandia awalnya muncul pada paruh pertama abad ke-10, dan terus berkembang selama abad-abad berikutnya.',
 'question': 'Abad berapa orang Normandia pertama kali mendapatkan identitas mereka yang terpisah?',
 'answers': {'answer_start': [671, 649, 671, 

In [132]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    '''display random elements from dataset

    Args:
        dataset (Dataset): dataset to show
        num_examples (int, optional): number of examples to show. Defaults to 10.
    '''
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(
                lambda x: [typ.feature.names[i] for i in x]
            )
    display(HTML(df.to_html()))

In [133]:
show_random_elements(datasets["train"]) # brace yourself, this is gonna be a long list

Unnamed: 0,id,title,context,question,answers
0,572a054f6aef0514001551b7,East_Prussia,"Prusia Timur menutupi sebagian besar tanah leluhur Prusia Kuno Baltik. Selama abad ke-13, penduduk asli Prusia ditaklukkan oleh Tentara Salib para Ksatria Teutonik. Suku Balt asli yang selamat dari penaklukan secara bertahap menjadi Kristen. Karena Jermanisasi dan penjajahan selama abad-abad berikutnya, Jerman menjadi kelompok etnis yang dominan, sedangkan Polandia dan Lituania menjadi minoritas. Dari abad ke-13, Prusia Timur adalah bagian dari negara biara Ksatria Teutonik. Setelah Kedamaian Duri Kedua pada tahun 1466 menjadi wilayah kekuasaan Kerajaan Polandia. Pada tahun 1525, dengan Penghormatan Prusia, provinsi tersebut menjadi Kadipaten Prusia. Bahasa Prusia Kuno telah punah pada abad ke-17 atau awal abad ke-18.",Apa kelompok lain selama periode ini untuk minoritas bentuk?,"{'answer_start': [359], 'text': ['Polandia dan Lituania']}"
1,5ad33f45604f3c001a3fdbc5,PlayStation_3,"Pada E3 2007, Sony mampu menampilkan sejumlah video game mendatang mereka untuk PlayStation 3, termasuk Heavenly Sword, Lair, Ratchet & Clank Future: Tools of Destruction, Warhawk dan Uncharted: Drake's Fortune; semuanya dirilis pada kuartal ketiga dan keempat tahun 2007. Mereka juga memamerkan sejumlah judul yang akan dirilis pada tahun 2008 dan 2009; terutama Killzone 2, Infamous, Gran Turismo 5 Prologue, LittleBigPlanet dan SOCOM: U.S. Navy SEALs Confrontation. Sejumlah eksklusif pihak ketiga juga ditampilkan, termasuk Metal Gear Solid 4: Guns of the Patriots yang sangat dinanti, bersama judul pihak ketiga profil tinggi lainnya seperti Grand Theft Auto IV, Call of Duty 4: Modern Warfare, Assassin's Creed, Devil May Cry 4 dan Resident Evil 5. Dua judul penting lainnya untuk PlayStation 3, Final Fantasy XIII dan Final Fantasy Versus XIII, ditampilkan di TGS 2007 untuk menenangkan pasar Jepang.",Game pihak ketiga yang paling dinantikan dengan nama bulan dalam tahun di dalamnya yang ditampilkan Sony di E4 2007?,"{'answer_start': [718], 'text': ['Devil May Cry 4']}"
2,56e7af6400c9c71400d774e0,Nanjing,"Saat ini, dengan tradisi budaya yang panjang dan dukungan kuat dari lembaga pendidikan setempat, Nanjing umumnya dipandang sebagai ""kota budaya"" dan salah satu kota yang lebih menyenangkan untuk ditinggali di Tiongkok.",Siapa yang memberikan dukungan kuat kepada Nanjing?,"{'answer_start': [68], 'text': ['lembaga pendidikan setempat,']}"
3,56dfc1797aa994140058e12e,Lighting,"The Professional Lighting And Sound Association (PLASA) adalah organisasi perdagangan berbasis di Inggris yang mewakili 500+ anggota individu dan perusahaan yang diambil dari sektor layanan teknis. Anggotanya termasuk produsen dan distributor pencahayaan panggung dan hiburan, suara, tali-temali dan produk dan layanan serupa, serta profesional terafiliasi di area tersebut. Mereka melobi dan mewakili kepentingan industri di berbagai tingkatan, berinteraksi dengan pemerintah dan badan pengatur dan mempresentasikan kasus untuk industri hiburan. Contoh subjek dari representasi ini termasuk peninjauan frekuensi radio yang sedang berlangsung (yang mungkin atau mungkin tidak memengaruhi pita radio yang digunakan mikrofon nirkabel dan perangkat lain) dan terlibat dengan masalah seputar pengenalan peraturan RoHS (Restriction of Hazardous Substances Directive) .",Apa singkatan dari RoHS?,"{'answer_start': [814], 'text': ['(Restriction of Hazardous']}"
4,5acd153607355d001abf33ce,Energy,"Energi total suatu sistem dapat dibagi lagi dan diklasifikasikan dalam berbagai cara. Misalnya, mekanika klasik membedakan antara energi kinetik yang ditentukan oleh pergerakan benda melalui ruang, dan energi potensial, yang merupakan fungsi dari posisi suatu benda dalam suatu medan. Mungkin juga mudah untuk membedakan energi gravitasi, energi panas, beberapa jenis energi nuklir (yang memanfaatkan potensial dari gaya nuklir dan gaya lemah), energi listrik (dari medan listrik), dan energi magnet (dari medan magnet) , diantara yang lain. Banyak dari klasifikasi ini tumpang tindih; misalnya, energi panas biasanya terdiri dari energi kinetik dan sebagian lagi energi potensial.",klasifikasi apa yang tidak tumpang tindih?,"{'answer_start': [596], 'text': ['energi panas biasanya terdiri dari energi kinetik dan sebagian lagi energi potensial.']}"
5,572840b1ff5b5019007da004,Federalism,"Federasi sering menggunakan paradoks sebagai penyatuan negara, sementara tetap menjadi negara (atau memiliki aspek kenegaraan) dalam dirinya sendiri. Misalnya, James Madison (penulis Konstitusi AS) menulis dalam Federalist Paper No. 39 bahwa Konstitusi AS ""dalam ketegasannya bukanlah konstitusi nasional maupun federal; tetapi komposisi keduanya. Pada dasarnya, itu federal, bukan nasional; dalam sumber dari mana kekuasaan biasa Pemerintah diambil, itu sebagian federal, dan sebagian nasional ... ""Ini berasal dari fakta bahwa negara bagian di AS mempertahankan semua kedaulatan sehingga mereka tidak menyerah kepada federasi dengan persetujuan mereka sendiri. Hal ini ditegaskan kembali oleh Amandemen Kesepuluh atas Konstitusi Amerika Serikat, yang memiliki semua kekuasaan dan hak yang tidak dilimpahkan kepada Pemerintah Federal sebagaimana diserahkan kepada Amerika Serikat dan rakyat.",Apa dasar dari kertas federalis no. 39?,"{'answer_start': [363], 'text': ['itu federal, bukan nasional; dalam sumber dari mana kekuasaan biasa Pemerintah diambil, itu sebagian federal, dan sebagian nasional ...']}"
6,5a863e3ab4e223001a8e7503,Russian_language,"Dialek Rusia Utara dan yang diucapkan di sepanjang Sungai Volga biasanya diucapkan tanpa tekanan / o / dengan jelas (fenomena ini disebut okanye / оканье). Selain tidak adanya reduksi vokal, beberapa dialek memiliki tinggi atau diftong / e ~ i̯ɛ / menggantikan Proto-Slavia * ě dan / o ~ u̯ɔ / dalam suku kata tertutup yang ditekankan (seperti dalam bahasa Ukraina), bukan Bahasa Rusia Standar / e / dan /Hai/. Ciri morfologi yang menarik adalah artikel pasti pasca-posed -to, -ta, -te mirip dengan yang ada di Bulgaria dan Makedonia.",Apa dialek yang digunakan di Makedonia?,"{'answer_start': [7], 'text': ['Rusia Utara']}"
7,5acd3fbb07355d001abf3a43,Estonian_language,"Meskipun ortografi Estonia umumnya dipandu oleh prinsip fonemik, dengan setiap grafem berhubungan dengan satu fonem, terdapat beberapa penyimpangan historis dan morfologis dari ini: misalnya pelestarian morfem dalam deklarasi kata (menulis b, g, d di tempat-tempat di mana p, k, t diucapkan) dan dalam penggunaan 'i' dan 'j'. [klarifikasi diperlukan] Jika sangat tidak praktis atau tidak mungkin untuk mengetikkan š dan ž, mereka diganti dengan sh dan zh dalam beberapa teks tertulis, meskipun ini dianggap tidak benar. Jika tidak, h in sh mewakili frikatif glotal yang tidak bersuara, seperti dalam Pasha (pas-ha); ini juga berlaku untuk beberapa nama asing.",Pada kesempatan apa š dan ž diganti dengan ch dan zu?,"{'answer_start': [356], 'text': ['sangat tidak praktis atau tidak mungkin untuk mengetikkan š dan ž,']}"
8,56de8c374396321400ee2a14,Arnold_Schwarzenegger,"Twins (1988), sebuah komedi dengan Danny DeVito, juga terbukti sukses. Total Recall (1990) menjaring Schwarzenegger $ 10 juta dan 15% dari pendapatan kotor film. Sebuah naskah fiksi ilmiah, film ini didasarkan pada cerita pendek Philip K. Dick ""We Can Remember It for You Wholesale"". Kindergarten Cop (1990) mempertemukannya kembali dengan sutradara Ivan Reitman, yang mengarahkannya di Twins. Schwarzenegger memiliki pengalaman singkat dalam penyutradaraan, pertama dengan episode tahun 1990 dari serial TV Tales from the Crypt, berjudul ""The Switch"", dan kemudian dengan film televisi tahun 1992 Natal di Connecticut. Dia tidak menyutradarai sejak itu.",Sebuah episode dari serial TV terkenal apa yang merupakan debut sutradara Schwarzenegger?,"{'answer_start': [519], 'text': ['the Crypt, berjudul']}"
9,5a299c8103c0e7001a3e1858,United_Nations_Population_Fund,"Presiden Bush menolak dana untuk UNFPA. Selama masa Pemerintahan Bush, sejumlah $ 244 juta dalam pendanaan yang disetujui Kongres diblokir oleh Cabang Eksekutif.",Pejabat pemerintah mana yang meningkatkan pendanaan untuk UNFPA?,"{'answer_start': [0], 'text': ['Presiden']}"


In [134]:
'''
    This function is for converting SQuAD json file to pandas dataframe, iteratively

    I dont want run this locally, better use colab

    doesn't needed anymore, use load_dataset instead
'''

if ENABLE_JSON2CSV:
    import utils
    for dir in dataset_dirs:
        with open(dir, encoding="utf-8") as json_file:
            file = json.load(json_file)
            dict_file = file
            data = dict_file['data']

        df = utils.json_to_df(data)
        df.to_csv(dir.replace(".json", ".csv"), index = False)

In [135]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast) # make sure tokenizer is pre-trained

---

In [136]:
max_length = 384  # The maximum length of a feature (question and context)
doc_stride = 128  # The allowed overlap between two part of the context when splitting is performed.

In [137]:
# check if there's dataset feature
# longer than max_length
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

In [138]:
# check if there's dataset feature longer than max_length
len(tokenizer(example["question"], example["context"])["input_ids"])

402

In [139]:
# check truncate dataset length
len(
    tokenizer(
        example["question"],
        example["context"],
        max_length=max_length,
        truncation="only_second",
    )["input_ids"]
)

384

In [140]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride,
)

In [141]:
[len(x) for x in tokenized_example["input_ids"]]

[384, 156]

In [142]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] beyonce menikah pada 2008 dengan siapa? [SEP] pada 4 april 2008, beyonce menikahi jay z. dia secara terbuka mengungkapkan pernikahan mereka dalam montase video di pesta mendengarkan untuk album studio ketiganya, i am... sasha fierce, di sony club manhattan pada 22 oktober 2008. i am... sasha fierce dirilis pada 18 november 2008 di amerika serikat. album ini secara resmi memperkenalkan alter ego beyonce sasha fierce, yang dibuat selama pembuatan singel tahun 2003 " crazy in love ", terjual 482. 000 kopi di minggu pertama, memulai debutnya di atas billboard 200, dan memberikan beyonce album nomor satu ketiganya berturut - turut di kami. album ini menampilkan lagu nomor satu " single ladies ( put a ring on it ) " dan lagu lima teratas " if i were a boy " dan " halo ". mencapai pencapaian menjadi single hot 100 terlama dalam karirnya, kesuksesan " halo " di as membantu beyonce mencapai lebih dari sepuluh single teratas dalam daftar daripada wanita lain selama tahun 2000 - an. ini jug

In [143]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride,
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 7), (8, 15), (16, 20), (21, 25), (26, 32), (33, 38), (38, 39), (0, 0), (0, 4), (5, 6), (7, 12), (13, 17), (17, 18), (19, 26), (27, 35), (36, 39), (40, 41), (41, 42), (43, 46), (47, 53), (54, 61), (62, 75), (76, 86), (87, 93), (94, 99), (100, 104), (104, 107), (108, 113), (114, 116), (117, 122), (123, 135), (136, 141), (142, 147), (148, 154), (155, 164), (164, 165), (166, 167), (168, 170), (171, 172), (172, 173), (173, 174), (175, 178), (178, 180), (181, 183), (183, 186), (186, 187), (187, 188), (189, 191), (192, 196), (197, 201), (202, 211), (212, 216), (217, 219), (220, 227), (228, 232), (232, 233), (234, 235), (236, 238), (239, 240), (240, 241), (241, 242), (243, 246), (246, 248), (249, 251), (251, 254), (254, 255), (256, 263), (264, 268), (269, 271), (272, 280), (281, 285), (286, 288), (289, 296), (297, 304), (304, 305), (306, 311), (312, 315), (316, 322), (323, 328), (329, 343), (344, 349), (350, 353), (354, 361), (362, 365), (365, 367), (368, 370), (370, 373), (373, 3

In [144]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [145]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained(model_name)

All model checkpoint layers were used when initializing TFBertForQuestionAnswering.

Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at cahya/bert-base-indonesian-522M and are newly initialized: ['qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [146]:
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (
    offsets[token_start_index][0] <= start_char
    and offsets[token_end_index][1] >= end_char
):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while (
        token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char
    ):
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

16 17


In [147]:
print(
    tokenizer.decode(
        tokenized_example["input_ids"][0][start_position : end_position + 1]
    )
)
print(answers["text"][0])

jay z
Jay Z


In [148]:
pad_on_right = tokenizer.padding_side == "right"

In [149]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (
                offsets[token_start_index][0] <= start_char
                and offsets[token_end_index][1] >= end_char
            ):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while (
                    token_start_index < len(offsets)
                    and offsets[token_start_index][0] <= start_char
                ):
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [150]:
features = prepare_train_features(datasets["train"][:5])

In [151]:
tokenized_datasets = datasets.map(
    prepare_train_features, batched=True, remove_columns=datasets["train"].column_names
)

100%|██████████| 115/115 [01:40<00:00,  1.14ba/s]
100%|██████████| 12/12 [00:10<00:00,  1.18ba/s]


In [152]:
from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained(model_name)

All model checkpoint layers were used when initializing TFBertForQuestionAnswering.

Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at cahya/bert-base-indonesian-522M and are newly initialized: ['qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [153]:
model_name = model_name.split("/")[-1]
push_to_hub_model_id = f"{model_name}-finetuned-squad"
learning_rate = 2e-5
num_train_epochs = 2
weight_decay = 0.01

In [154]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

In [155]:
train_set = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
validation_set = tokenized_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

In [156]:
from transformers import create_optimizer

total_train_steps = (len(tokenized_datasets["train"]) // batch_size) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=learning_rate, num_warmup_steps=0, num_train_steps=total_train_steps
)

In [157]:
import tensorflow as tf

model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [158]:
from transformers.keras_callbacks import PushToHubCallback
from tensorflow.keras.callbacks import TensorBoard

tensorboard_callback = TensorBoard(log_dir="./qa_model_save/logs")

callbacks = [tensorboard_callback]

model.fit(
    train_set,
    validation_data=validation_set,
    epochs=num_train_epochs,
    callbacks=callbacks,
)

Epoch 1/2


In [None]:
batch = next(iter(validation_set))
output = model.predict_on_batch(batch)
output.keys()

In [None]:
output.start_logits.shape, output.end_logits.shape

In [None]:
import numpy as np

np.argmax(output.start_logits, -1), np.argmax(output.end_logits, -1)

In [None]:
n_best_size = 20

In [None]:
import numpy as np

start_logits = output.start_logits[0]
end_logits = output.end_logits[0]
# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if (
            start_index <= end_index
        ):  # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "",  # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

In [None]:
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [None]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names,
)

In [None]:
validation_dataset = validation_features.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

In [None]:
raw_predictions = model.predict(validation_dataset)

In [None]:
max_answer_length = 30

In [None]:
start_logits = output.start_logits[0]
end_logits = output.end_logits[0]
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if (
            start_index <= end_index
        ):  # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char:end_char],
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[
    :n_best_size
]
valid_answers

In [None]:
datasets["validation"][0]["answers"]

In [None]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)