<a href="https://colab.research.google.com/github/Crystal-Reshea/FinBert-Albert-nlp/blob/main/Fine_Tuning_Albert_for_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview
Actions Completed
* Trained Albert base model for Question Answering
     * Basis for code: https://towardsdatascience.com/how-to-fine-tune-a-q-a-transformer-86f91ec92997
     * Trained on 3 epochs, batch size = 8 
     * Ideally I would like to train on 4 epochs, with a batch size of 16 and learning rate of 3e-5. I didn't do this the first time out of caution of RAM but I should have enough
* Prepped 10-K data
  * The 10-K data is large and there are many lines that we do not need. I extracted text by groups of lines and then removed new lines. 
  * Most bert models including Albert have a max sequence of 512 so there should be a method of how to handle this. Currently the tokenizer truncates the text if it hits the max. There's another idea below. 
* Used model on Item 7
  * The model does okay. It got one question wrong but it may be due to the text being truncated. (We could truncate the text and use the question on all parts of the text and then accept the answer with the highest score.) 

Things I'd like to do: 
* Validate model and evaluate performance
* organize Item 7 by sections (headings)
* Retrain for better results

In [None]:
pip install transformers

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import json

def read_squad(path):
    # open JSON file and load intro dictionary
    with open(path, 'rb') as file:
        squad2_dict = json.load(file)
        
    contexts = []
    questions = []
    answers = []
    # iterate through all data in squad data
    for group in squad2_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                # check if we need to be extracting from 'answers' or 'plausible_answers'
                if 'plausible_answers' in qa.keys():
                    access = 'plausible_answers'
                else:
                    access = 'answers'
                for answer in qa[access]:
                    # append data to lists
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)
    # return formatted data lists
    return contexts, questions, answers

# execute our read SQuAD function for training and validation sets
train_contexts, train_questions, train_answers = read_squad('/content/drive/MyDrive/NLP_POC/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('/content/drive/MyDrive/NLP_POC/dev-v2.0.json')

In [None]:
train_questions[:5]

['When did Beyonce start becoming popular?',
 'What areas did Beyonce compete in when she was growing up?',
 "When did Beyonce leave Destiny's Child and become a solo singer?",
 'In what city and state did Beyonce  grow up? ',
 'In which decade did Beyonce become famous?']

In [None]:
def add_end_idx(answers, contexts):
    # loop through each answer-context pair
    for answer, context in zip(answers, contexts):
        # gold_text refers to the answer we are expecting to find in context
        gold_text = answer['text']
        # we already know the start index
        start_idx = answer['answer_start']
        # and ideally this would be the end index...
        end_idx = start_idx + len(gold_text)

        # ...however, sometimes squad answers are off by a character or two
        if context[start_idx:end_idx] == gold_text:
            # if the answer is not off :)
            answer['answer_end'] = end_idx
        else:
            # this means the answer is off by 1-2 tokens
            for n in [1, 2]:
                if context[start_idx-n:end_idx-n] == gold_text:
                    answer['answer_start'] = start_idx - n
                    answer['answer_end'] = end_idx - n
            
# and apply the function to our two answer lists
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [None]:
from transformers import AlbertTokenizerFast
# initialize the tokenizer
tokenizer = AlbertTokenizerFast.from_pretrained('albert-base-v2')

Downloading:   0%|          | 0.00/742k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/684 [00:00<?, ?B/s]

In [None]:
# tokenize
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

In [None]:
tokenizer.decode(train_encodings['input_ids'][250])

'[CLS] her fourth studio album 4 was released on june 28, 2011 in the us. 4 sold 310,000 copies in its first week and debuted atop the billboard 200 chart, giving beyonce her fourth consecutive number-one album in the us. the album was preceded by two of its singles "run the world (girls)" and "best thing i never had", which both attained moderate success. the fourth single "love on top" was a commercial success in the us. 4 also produced four other singles; "party", "countdown", "i care" and "end of time". "eat, play, love", a cover story written by beyonce for essence that detailed her 2010 career break, won her a writing award from the new york association of black journalists. in late 2011, she took the stage at new york\'s roseland ballroom for four nights of special performances: the 4 intimate nights with beyonce concerts saw the performance of her 4 album to a standing room only.[SEP] where did beyonce perform for four nights of standing room only concerts in 2011?[SEP]<pad><pa

In [None]:
def add_token_positions(encodings, answers):
    # initialize lists to contain the token indices of answer start/end
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        # append start/end token position using char_to_token method
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        # end position cannot be found, char_to_token found space, so shift position until found
        shift = 1
        while end_positions[-1] is None:
            end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end'] - shift)
            shift += 1
    # update our encodings object with the new token-based start/end positions
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

# apply function to our data
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [None]:
from transformers import AlbertTokenizer, AlbertForQuestionAnswering
model = AlbertForQuestionAnswering.from_pretrained('albert-base-v2')

Downloading:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForQuestionAnswering: ['predictions.decoder.bias', 'predictions.dense.weight', 'predictions.LayerNorm.bias', 'predictions.LayerNorm.weight', 'predictions.bias', 'predictions.decoder.weight', 'predictions.dense.bias']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN t

In [None]:
import torch

class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

# build datasets for both our training and validation sets
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

In [None]:
from torch.utils.data import DataLoader
from transformers import AdamW
from tqdm import tqdm

# setup GPU/CPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# move model over to detected device
model.to(device)
# activate training mode of model
model.train()
# initialize adam optimizer with weight decay (reduces chance of overfitting)
optim = AdamW(model.parameters(), lr=5e-5)

# initialize data loader for training data
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

for epoch in range(3):
    # set model to train mode
    model.train()
    # setup loop (we use tqdm for the progress bar)
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all the tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        # train model on batch and return outputs (incl. loss)
        outputs = model(input_ids, attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)
        # extract loss
        loss = outputs[0]
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Epoch 0: 100%|██████████| 16290/16290 [2:06:57<00:00,  2.14it/s, loss=0.796]
Epoch 1: 100%|██████████| 16290/16290 [2:06:56<00:00,  2.14it/s, loss=1.16]
Epoch 2: 100%|██████████| 16290/16290 [2:06:56<00:00,  2.14it/s, loss=0.769]


In [None]:
model_path = '/content/drive/MyDrive/NLP_POC/models'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/drive/MyDrive/NLP_POC/models/tokenizer_config.json',
 '/content/drive/MyDrive/NLP_POC/models/special_tokens_map.json',
 '/content/drive/MyDrive/NLP_POC/models/tokenizer.json')

Started at 3:21 pm <br>
Check-in #1 at 3:45 pm <br>
Check-in #2 at 4:02 pm | epoch 0, 33% <br>
Check-in #3 at 4:27 pn | epoch 0, 52% <br>
Check-in #4 at 4:58 pm | epoch 0, 77% <br> 
Check-in #5 at 5:08 pm | epoch 0, 84% <br>
Check-in #6 at 5:40 pm | epoch 1, 9%  <br> 
Check-in #7 at 6:03 pm | epoch 1, 28% <br> 
Check-in #8 at 6:23 pm | epoch 1, 43% <br> 
Check-in #9 at 6:38 pm | epoch 1, 56% <br> 
Check-in #10 at 7:00 pm | epoch 1, 73% <br>
Check-in #11 at 7:13 pm | epoch1, 83% <br>
Check-in #12 at 7:50 pm | epoch 2, 12% <br> 
Check-in #13 at 8:36 pm | epoch 2, 48% <br> 
Check-in #14 at 8:55 pm | epoch 2, 63% <br>
Check-in #15 at 9:21 pm | epoch 2, 83% <br>
Check-in #16 at 9:43 pm | epoch 2, 100%

In [None]:
model = AlbertForQuestionAnswering.from_pretrained(model_path)
tokenizer = AlbertTokenizerFast.from_pretrained(model_path)

# Training Code to try Later for Better Performance

In [None]:
# If time allows try training on these parameters
from torch.utils.data import DataLoader
from transformers import AdamW
from tqdm import tqdm

# setup GPU/CPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# move model over to detected device
model.to(device)
# activate training mode of model
model.train()
# initialize adam optimizer with weight decay (reduces chance of overfitting)
optim = AdamW(model.parameters(), lr=3e-5)

# initialize data loader for training data
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

for epoch in range(4):
    # set model to train mode
    model.train()
    # setup loop (we use tqdm for the progress bar)
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all the tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        # train model on batch and return outputs (incl. loss)
        outputs = model(input_ids, attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)
        # extract loss
        loss = outputs[0]
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

# Using the New Model


In [None]:
pip install transformers



In [None]:
import torch

In [None]:
import transformers
from transformers import AlbertForQuestionAnswering
from transformers import AlbertTokenizerFast

In [None]:
model = AlbertForQuestionAnswering.from_pretrained('/content/drive/MyDrive/NLP_POC/models')

In [None]:
tokenizer = AlbertTokenizerFast.from_pretrained('/content/drive/MyDrive/NLP_POC/models')

## Function to Process Answers through Model

In [None]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text, truncation=True,)

    # Report how long the input sequence is.
    print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example through the model.
    outputs = model(torch.tensor([input_ids]), # The tokens representing our input text.
                    token_type_ids=torch.tensor([segment_ids]), # The segment IDs to differentiate question from answer_text
                    return_dict=True) 

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    return answer

## Simple Sample Questions

In [None]:
answer_question("What is the best restaurant in the United States?", "The most beloved restaurant in the united states is called Cook Out. Cook out is a national treasure everyone loves it.")


Query has 36 tokens.

Answer: "▁cook ▁out ."


In [None]:
question = "What country did Tik Tok start in?"
context = "TikTok, known in China as Douyin (Chinese: 抖音; pinyin: Dǒuyīn), is a video-focused social networking service owned by Chinese company ByteDance.[4] It hosts a variety of short-form user videos, from genres like pranks, stunts, tricks, jokes, dance, and entertainment[5][6] with durations from 15 seconds to three minutes.[7][8][9] TikTok is an international version of Douyin, which was originally released in the Chinese market in September 2016.[10] TikTok was launched in 2017 for iOS and Android in most markets outside of mainland China; however, it became available worldwide only after merging with another Chinese social media service, Musical.ly, on 2 August 2018. TikTok and Douyin have almost the same user interface but no access to each other's content. Their servers are each based in the market where the respective app is available.[11] The two products are similar, but features are not identical. Douyin includes an in-video search feature that can search by people's faces for more videos of them and other features such as buying, booking hotels and making geo-tagged reviews.[12] Since its launch in 2016, TikTok/Douyin rapidly gained popularity in East Asia, South Asia, Southeast Asia, the United States, Turkey, Russia, and other parts of the world.[13][14] As of October 2020, TikTok surpassed over 2 billion mobile downloads worldwide.[15][16][17] Morning Consult ranked TikTok as the third fastest growing brand of 2020, after only Zoom and Peacock.[18]"

In [None]:
answer_question(question, context)

Query has 376 tokens.

Answer: "▁china"


In [None]:
question = "About how many people are using Tik Tok?"
answer_question(question, context)

Query has 377 tokens.

Answer: "▁2 ▁billion"


In [None]:
# A future goal is to fix the preprocessing of the data so that it is trained to return no answer if not possible.
question = "What country did Instagram start in?"
answer_question(question, context)

Query has 373 tokens.

Answer: "▁china"


## Processing the 10-k Form - Item 7

In [None]:
file = '/content/drive/MyDrive/NLP_POC/bby-202110k.txt'

In [None]:
def extract1(txt, start, end):
  assert start <= end, "Start should be less than or equal to end"
  lines = []
  i = 1
  flag = False
  with open(txt, 'r') as file: 
    for line in file:
        if end == i: 
          lines.append(line)
          break
        elif flag == True:
          lines.append(line)
        if start == i:
          flag = True
          lines.append(line)
        i+=1
  file.close()
  return ''.join(lines)

In [None]:
start=[1135,1156,1161,1168,1172,1189,1209,1235,1292,1297,1307,1311,1318,1322,1326,1340,1352,1365,1592,1610,1626,1646,1659,1673,1677,1682,1687,1692,1697,1719,1723,1737,1748,1753,1757,1761,1766,1770,1778,1783,1785,1788,1831,1837,1867,1877,1964,1971,1985,1991,1994,2002,2048,2099,2115]
end = [1154,1158,1164,1168,1182,1207,1229,1284,1293,1298,1307,1314,1318,1322,1336,1350,1359,1408,1604,1624,1644,1652,1670,1673,1678,1683,1688,1694,1712,1721,1735,1745,1749,1753,1757,1762,1766,1770,1780,1783,1786,1825,1835,1865,1871,1888,1967,1977,1988,1992,1995,2042,2093,2113,2142]

In [None]:
paragraph = []
for i in range(len(start)): 
  paragraph.append(extract1(file, start[i],end[i]))

joined_paragraph = ''.join(paragraph)
print(joined_paragraph)

In [None]:
joined_paragraph = joined_paragraph.replace('\n', ' ')

# Proposed Idea for better results
* Break up Item 7 into section based off of the headings. And then Ask questions for those headings. 
* Think about a way to chunck the large sections

In [None]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text, truncation=True,)

    # Report how long the input sequence is.
    # print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example through the model.
    outputs = model(torch.tensor([input_ids]), # The tokens representing our input text.
                    token_type_ids=torch.tensor([segment_ids]), # The segment IDs to differentiate question from answer_text
                    return_dict=True) 

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    
    alpha = {'a':'a', 'b':'b', 'c':'c', 'd':'d', 'e':'e', 'f':'f', 'g':'g','h':'h','i':'i', 'j':'j', 'k':'k', 'l':'l',
             'm':'m', 'n':'n', 'o':'o', 'p':'p', 'q':'q', 'r':'r', 's':'s', 't':'t', 'u':'u', 'v':'v', 'w':'w', 'x':'x', 'y':'y', 'z':'z',
             '.':'.', '!':'!', ',':',', '(':'('}
    # Start with the first token.
    if tokens[answer_start][0:1] not in alpha:
      answer = tokens[answer_start][1:]
    else: 
      answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:1] not in alpha :
            answer += " " + tokens[i][1:]
        
        # Otherwise, add a space then the token.
        else:
            answer += '' + tokens[i]
    return answer

In [None]:
context = joined_paragraph
questions = ['What type of company is this?', 'What changes have been made in response to Covid-19?',
             'How many stores will have closed by the end of the fiscal year 2021?', ]

In [None]:
overview = context[:2707]

In [None]:
answer_to_fix = answer_question(questions[0], overview)
print(answer_to_fix)

omnichannel retailer that makes it easy for our customers to feel at home. we sell a wide assortment of merchandise in the home, baby, beauty & wellness markets and operate under the names bed bath & beyond   bbb  , buybuy baby   baby  , and harmon, harmon face values, or face values  collectively,  harmon  . we also operate decorist, an online interior design platform that provides personalized home design services. in addition, we are a partner in a joint venture, which operates retail stores


In [None]:
answer_to_fix

'▁we have undertaken significant changes over the past year  including extensive changes to executive leadership '

In [None]:
x[0][500:]


tensor([-5.2800, -3.4725, -2.4860, -5.0513, -0.2213, -5.1217, -3.9617, -4.2264,
         2.3934,  1.9139, -4.7331, -1.5098], grad_fn=<SliceBackward0>)

In [None]:
for i in range(len(questions)):
  print("Question: " + questions[i] + "\nAnswer: " + answer_question(questions[i], overview))

Question: What type of company is this?
Answer: ▁omnichannel▁retailer▁that▁makes▁it▁easy▁for▁our▁customers▁to▁feel▁at▁home.▁we▁sell▁a▁wide▁assortment▁of▁merchandise▁in▁the▁home,▁baby,▁beauty▁&▁wellness▁markets▁and▁operate▁under▁the▁names▁bed▁bath▁&▁beyond▁("bbb"),▁buybuy▁baby▁("baby"),▁and▁harmon,▁harmon▁face▁values,▁or▁face▁values▁(collectively,▁"harmon").▁we▁also▁operate▁decorist,▁an▁online▁interior▁design▁platform▁that▁provides▁personalized▁home▁design▁services.▁in▁addition,▁we▁are▁a▁partner▁in▁a▁joint▁venture,▁which▁operates▁retail▁stores
Question: What changes have been made in response to Covid-19?
Answer: ▁we▁have▁undertaken▁significant▁changes▁over▁the▁past▁year,▁including▁extensive▁changes▁to▁executive▁leadership,
Question: How many stores will have closed by the end of the fiscal year 2021?
Answer: ▁one


In [None]:
tokens[1][0:1]

'▁'

In [None]:
# ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text, truncation=True,)

    # Report how long the input sequence is.
    # print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b


In [None]:
input_ids = tokenizer.encode(questions[0], overview, truncation=True,)