<a href="https://colab.research.google.com/github/tianjianjiang/imtku_for_dial_eval_1/blob/master/cudatype2_4_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Check List

- [ ] Exercise 0: [Attribute](#Attribute)
- [ ] Exercise 1: [Always check the warnings](#scrollTo=ctVODCjcw0kf)
- [ ] Exercise 3: [Which version of TF?](#scrollTo=9UIufWQedr9R)
- [ ] Exercise 4: What are other issues of this notebook?

# Attribute

What are the sources of the code snippets here?

Attribution is almost always the first rule to use open sourced materials.

It is also the basic academic integrity.

Not to mention it is ultimately for our short memory's sake.

# Prepare

## GPU

In [None]:
# Ensure GPU spec; T4 is preferred for colab and one can change it for another env.
NVIDIA_SMI_PATHS = !which nvidia-smi
GPU_LIST = []
if NVIDIA_SMI_PATHS:
    GPU_LIST = !nvidia-smi -L
if not GPU_LIST:
  print('On CPU because `nvidia-smi` is not found.')
elif not GPU_LIST[0].startswith('GPU 0: Tesla T4'):
  display(GPU_LIST)
  print('For Colab, if T4 is preferred, please Factory reset runtime until it is.')
else:
  display(GPU_LIST)

## Dependencies

### Install

#### Defaults

Ensure no surprises from conflict packages.

In [None]:
!pip3 install -U pip
!pip3 check

Capture logs to prevent them from consuming network bandwidth.

In [None]:
%%capture pip_logs
!pip3 install -U datascience albumentations coveralls

Verify the captured logs and then check the package dependencies again.

In [None]:
def verify(captured_logs):
  colab_vnd = 'application/vnd.colab-display-data+json'
  for o in captured_logs.outputs:
    if colab_vnd in o.data and 'pip_warning' in o.data[colab_vnd]:
      o.display()

In [None]:
verify(pip_logs)
!pip3 check

#### Which version of TF?

It doesn't seem required, and many models are still using TF1.

If they are used somewhere, please be explicit.

Also, TF2 contains both CPU and GPU parts now, no need to specify `tensorflow-gpu` anymore, just `tensorflow`.

**If the required version is TF1, just `tensorflow-gpu` won't suffice, one must also specify a version number.**



In [None]:
%%capture pip_tensorflow_and_transformer_logs
!pip3 install -U tensorflow-gpu transformers

In [None]:
verify(pip_tensorflow_and_transformer_logs)
!pip3 check

In [None]:
import tensorflow as tf
tf.__version__

#### HuggingFace Transformers

In [None]:
%%capture pip_py_transformers_logs
!pip3 install pytorch-transformers

In [None]:
verify(pip_py_transformers_logs)
!pip3 check

### Import


In [None]:
import collections
from pathlib import Path

from google.colab import drive
import pandas as pd
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from torch.utils.data import Dataset

from pytorch_transformers import (WEIGHTS_NAME, BertConfig,
                                  BertForQuestionAnswering, BertTokenizer)

### Init

#### Paths

In [None]:
COLAB_CONTENT_DIR_P = Path('/content')
GD_DIR_P = COLAB_CONTENT_DIR_P / 'gdrive'
drive.mount(str(GD_DIR_P), force_remount=True)

In [None]:
PRJ_NAME = 'imtku_dial_eval_1'

BASE_DIR_P = GD_DIR_P / f'My Drive/{PRJ_NAME}'
BASE_DIR_P.mkdir(parents=True, exist_ok=True)
DATA_DIR_P = BASE_DIR_P / 'data'
DATA_DIR_P.mkdir(parents=True, exist_ok=True)

%cd "{DATA_DIR_P}"
%pwd

#### Files

此處為導入資料集並丟入訓練模型的載入資料部分 完成後會刪除

@tianjianjian: when and where did the deletion happen? 

In [None]:
!rm -rf bert-chinese-qa*
!wget -q --no-check-certificate -r 'https://drive.google.com/uc?export=download&id=1GQtGFd-1AvZHZuYckhA3xqvvpDk-x5DW' -O bert-chinese-qa.zip
!unzip bert-chinese-qa.zip -d bert-chinese-qa

In [None]:
!git clone https://github.com/stg880631/BERT-Practice.git

In [None]:
%ls -R #https://github.com/stg880631/BERT-Practice.git

In [None]:
PdContestQuestion_A=pd.read_json(f'{DATA_DIR_P}/BERT-Practice/FGC_release_A.json')
PdContestQuestion_A.head(10)
#print(PdContestQuestion_A)

In [None]:
PdContestAnswers_A=pd.read_json(f'{DATA_DIR_P}/BERT-Practice/FGC_release_A_answers.json')
#print(PdContestAnswers_A)

#PdContestQuestion_B=pd.read_json('./BERT-Practice/FGC_release_B.json')
#print(PdContestQuestion_B)

#PdContestAnswers_B=pd.read_json('./BERT-Practice/FGC_release_B_answers.json')
#print(PdContestAnswers_B)

#PdTrainFirst=pd.read_json('./BERT-Practice/DRCD_dev.json')
#print(PdTrainFirst)

#PdTrainSecond=pd.read_json('./BERT-Practice/DRCD_test.json')
#print(PdTrainSecond)

#PdTrainThird=pd.read_json('./BERT-Practice/DRCD_training.json')
#print(PdTrainThird)

PdCSV=pd.read_csv(f'{DATA_DIR_P}/BERT-Practice/FGC_release_A_1.csv', encoding = 'big5')

#pd.get_dummies  # @tianjianjiang: what was the purpose of this?

# Define

## What are those functions for?

In [None]:
def to_list(tensor):
    return tensor.detach().cpu().tolist()
 

def _get_best_indexes(logits, n_best_size=1):
    """Get the n-best logits from a list."""
    index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)

    best_indexes = []
    for i in range(len(index_and_score)):
        if i >= n_best_size:
            break
        best_indexes.append(index_and_score[i][0])
    return best_indexes
 

def evaluate(dataset, model, tokenizer):
    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=1)

    # Eval!
    all_results = []
    for batch in eval_dataloader:
        model.eval()
        batch = tuple(t.to(device) for t in batch)
        with torch.no_grad():
            inputs = {'input_ids':      batch[0],
                      'attention_mask': batch[1],
                      'token_type_ids': batch[2],
                      }
            example_indices = batch[3]
            outputs = model(**inputs)
            start_logits = to_list(outputs[0][0])
            end_logits   = to_list(outputs[1][0])
            start_indexes = _get_best_indexes(start_logits)
            end_indexes = _get_best_indexes(end_logits)
    return (start_indexes, end_indexes)

In [None]:
def _check_is_max_context(doc_spans, cur_span_index, position):
    """Check if this is the 'max context' doc span for the token."""

    # Because of the sliding window approach taken to scoring documents, a single
    # token can appear in multiple documents. E.g.
    #  Doc: the man went to the store and bought a gallon of milk
    #  Span A: the man went to the
    #  Span B: to the store and bought
    #  Span C: and bought a gallon of
    #  ...
    #
    # Now the word 'bought' will have two scores from spans B and C. We only
    # want to consider the score with "maximum context", which we define as
    # the *minimum* of its left and right context (the *sum* of left and
    # right context will always be the same, of course).
    #
    # In the example the maximum context for 'bought' would be span C since
    # it has 1 left context and 3 right context, while span B has 4 left context
    # and 0 right context.
    best_score = None
    best_span_index = None
    for (span_index, doc_span) in enumerate(doc_spans):
        end = doc_span.start + doc_span.length - 1
        if position < doc_span.start:
            continue
        if position > end:
            continue
        num_left_context = position - doc_span.start
        num_right_context = end - position
        score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
        if best_score is None or score > best_score:
            best_score = score
            best_span_index = span_index

    return cur_span_index == best_span_index


def convert_examples_to_features(tokenizer, question_text, doc_tokens, max_seq_length=384,
                                 doc_stride=1, max_query_length=35,
                                 cls_token_at_end=False,
                                 cls_token='[CLS]', sep_token='[SEP]', pad_token=0,
                                 sequence_a_segment_id=0, sequence_b_segment_id=1,
                                
                                 cls_token_segment_id=0, pad_token_segment_id=0,
                                 mask_padding_with_zero=True):
    """Loads a data file into a list of `InputBatch`s."""
    query_tokens = tokenizer.tokenize(question_text)
    #print(query_tokens)(test)
    if len(query_tokens) > max_query_length:
      query_tokens = query_tokens[0:max_query_length]
    tok_to_orig_index = []
    orig_to_tok_index = []
    all_doc_tokens = []
    for (i, token) in enumerate(doc_tokens):
        orig_to_tok_index.append(len(all_doc_tokens))
        sub_tokens = tokenizer.tokenize(token)
        for sub_token in sub_tokens:
            tok_to_orig_index.append(i)
            all_doc_tokens.append(sub_token)#turn DTEXT into tokens
        #print(sub_tokens)#(test)

    # The -3 accounts for [CLS], [SEP] and [SEP]
    max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
       #print(max_seq_length)(test)
       #print(len(query_tokens))(test)
    # We can have documents that are longer than the maximum sequence length.
    # To deal with this we do a sliding window approach, where we take chunks
    # of the up to our max length with a stride of `doc_stride`.
    _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
        "DocSpan", ["start", "length"])
    doc_spans = []
    start_offset = 0
    while start_offset < len(all_doc_tokens):
        length = len(all_doc_tokens) - start_offset
        if length > max_tokens_for_doc:
            length = max_tokens_for_doc
        doc_spans.append(_DocSpan(start=start_offset, length=length))
        if start_offset + length == len(all_doc_tokens):
            break
        start_offset += min(length, doc_stride)

    #input_ids = torch.tensor([0], dtype=torch.long)
    #input_mask = torch.tensor([0], dtype=torch.long)
    #segment_ids = torch.tensor([0], dtype=torch.long)
    #cls_index = torch.tensor([0], dtype=torch.long)
    #p_mask = torch.tensor([0], dtype=torch.float)
    #example_index = torch.arange(input_ids.size(0), dtype=torch.long)
    #tokens = []
    for (doc_span_index, doc_span) in enumerate(doc_spans):
        tokens = []
        token_to_orig_map = {}
        token_is_max_context = {}
        segment_ids = []

        # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
        # Original TF implem also keep the classification token (set to 0) (not sure why...)
        p_mask = []

        # CLS token at the beginning
        if not cls_token_at_end:
            tokens.append(cls_token)
            segment_ids.append(cls_token_segment_id)
            p_mask.append(0)
            cls_index = 0

        # Query
        for token in query_tokens:
            tokens.append(token)
            segment_ids.append(sequence_a_segment_id)
            p_mask.append(1)

        # SEP token
        tokens.append(sep_token)
        segment_ids.append(sequence_a_segment_id)
        p_mask.append(1)

        # Paragraph
        for i in range(doc_span.length):
            split_token_index = doc_span.start + i
            #print(split_token_index)******
            token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]

            is_max_context = _check_is_max_context(doc_spans, doc_span_index,
                                                   split_token_index)
            token_is_max_context[len(tokens)] = is_max_context#type:boolean

            tokens.append(all_doc_tokens[split_token_index])
            segment_ids.append(sequence_b_segment_id)
            p_mask.append(0)
        paragraph_len = doc_span.length
        #print(paragraph_len)(test)

        # SEP token
        tokens.append(sep_token)
        segment_ids.append(sequence_b_segment_id)
        p_mask.append(1)

        # CLS token at the end
        if cls_token_at_end:
            tokens.append(cls_token)
            segment_ids.append(cls_token_segment_id)
            p_mask.append(0)
            cls_index = len(tokens) - 1  # Index of classification token

        input_ids = tokenizer.convert_tokens_to_ids(tokens)
        #print(input_ids)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
        #print(input_mask)

        # Zero-pad up to the sequence length.
        while len(input_ids) < max_seq_length:
            input_ids.append(pad_token)
            input_mask.append(0 if mask_padding_with_zero else 1)
            segment_ids.append(pad_token_segment_id)
            p_mask.append(1)

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length
    input_ids = torch.tensor([input_ids], dtype=torch.long)
    input_mask = torch.tensor([input_mask], dtype=torch.long)
    segment_ids = torch.tensor([segment_ids], dtype=torch.long)
    cls_index = torch.tensor([cls_index], dtype=torch.long)
    p_mask = torch.tensor([p_mask], dtype=torch.float)
    example_index = torch.arange(input_ids.size(0), dtype=torch.long)
    data = TensorDataset(input_ids, input_mask, segment_ids,
                            example_index, cls_index, p_mask)


    ##print("*** Example ***")
    # print("doc_span_index: %s" % (doc_span_index))
    ##print("tokens: %s" % " ".join(tokens))
    # print("token_to_orig_map: %s" % " ".join([
    #                 "%d:%d" % (x, y) for (x, y) in token_to_orig_map.items()]))
    # print("token_is_max_context: %s" % " ".join([
    #                 "%d:%s" % (x, y) for (x, y) in token_is_max_context.items()
    #             ]))
    # print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
    # print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
    # print("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))

    return data, tokens

## DRCDDataSet class

In [None]:
#此部分為設定訓練模型 對照LEEMENG文章第二點
"""
實作一個可以用來讀取訓練 / 測試集的 Dataset，這是你需要徹底了解的部分。
此 Dataset 每次將 tsv 裡的一筆成對句子轉換成 BERT 相容的格式，並回傳 3 個 tensors：
- tokens_tensor：兩個句子合併後的索引序列，包含 [CLS] 與 [SEP]
- segments_tensor：可以用來識別兩個句子界限的 binary tensor
- label_tensor：將分類標籤轉換成類別索引的 tensor, 如果是測試集則回傳 None
"""
    
class DRCDDataset(Dataset):
    # 讀取前處理後的 tsv 檔並初始化一些參數
    def __init__(self, mode, tokenizer):
        assert mode in ["train", "test"]  # 一般訓練你會需要 dev set
        self.mode = mode
        # 大數據你會需要用 iterator=True
        self.df = pd.read_csv(mode + ".tsv", sep="\t").fillna("")
        self.len = len(self.df)
        self.tokenizer = tokenizer  # 我們將使用 BERT tokenizer
        #self.label_map={}
            #for i in len(df_train):
              #updata={answer:i}
              #self.label_map.update(updata)

    # 定義回傳一筆訓練 / 測試數據的函式
    def __getitem__(self, idx):
        if self.mode == "test":
            question, document = self.df.iloc[idx, :2].values
            label_tensor = None
        else:
            question, document, answer = self.df.iloc[idx, :].values
            # 將 label 文字也轉換成索引方便轉換成 tensor
            #answer_id = answer
              
            #label_tensor=tf.string_to_number(answer,out_type=None,name=None)
            #label_tensor=tf.convert_to_tensor(answer,dtype=None,dtype_hint=None,name=None)
            #answer_id= self.tokenizer.tokenize(answer)
            token_answer = self.tokenizer.tokenize(answer)
            answer_ids = self.tokenizer.convert_tokens_to_ids(token_answer)
            label_tensor = torch.Tensor(answer_ids)
            #label_tensor = torch.tensor(answer_id)
            #label_tensor= answer


        # 建立第一個句子的 BERT tokens 並加入分隔符號 [SEP]
        word_pieces = ["[CLS]"]
        tokens_question = self.tokenizer.tokenize(question)
        word_pieces += tokens_question + ["[SEP]"]
        len_a = len(word_pieces)
        
        # 第二個句子的 BERT tokens
        tokens_document = self.tokenizer.tokenize(document)
        word_pieces += tokens_document + ["[SEP]"]
        len_b = len(word_pieces) - len_a
        
        # 將整個 token 序列轉換成索引序列
        ids = self.tokenizer.convert_tokens_to_ids(word_pieces)
        tokens_tensor = torch.Tensor(ids)
        
        # 將第一句包含 [SEP] 的 token 位置設為 0，其他為 1 表示第二句
        segments_tensor = torch.cuda.LongTensor([0] * len_a + [1] * len_b)#, 
                                      #  dtype=torch.long)
        
        return (tokens_tensor, segments_tensor, label_tensor)
    
    def __len__(self):
        return self.len

## Batcher

In [None]:
#此部分為設定訓練模型 對照LEEMENG文章第二點

# 這個函式的輸入 `samples` 是一個 list，裡頭的每個 element 都是
# 剛剛定義的 `FakeNewsDataset` 回傳的一個樣本，每個樣本都包含 3 tensors：
# - tokens_tensor
# - segments_tensor
# - label_tensor
# 它會對前兩個 tensors 作 zero padding，並產生前面說明過的 masks_tensors
def create_mini_batch(samples):
    tokens_indexes = [s[0] for s in samples]
    tokens_indexes_tensors = torch.stack(tokens_indexes).contiguous()
    tokens_index_tensors=torch.cuda.LongTensor(tokens_indexes_tensors)
    segments_indexes = [s[1] for s in samples]
    segments_indexes_tensors = torch.stack(segments_indexes).contiguous()
    segments_index_tensors=torch.cuda.LongTensor(segments_indexes_tensors)

    
    # 訓練集有 labels(answer)
    if samples[0][2] is not None:
        label_tensor = [s[2] for s in samples]
    else:
        label_tensor = None
    
    # zero pad 到同一序列長度
    tokens_index_tensors = pad_sequence(tokens_index_tensors, 
                                  batch_first=True).cuda()
    segments_index_tensors = pad_sequence(segments_index_tensors, 
                                    batch_first=True).cuda()
    
    #label_tensor= pad_sequence(label_tensor, 
                                    #batch_first=True)
    
    # attention masks，將 tokens_tensors 裡頭不為 zero padding
    # 的位置設為 1 讓 BERT 只關注這些位置的 tokens
    masks_tensors = torch.zeros(tokens_index_tensors.shape, 
                                dtype=torch.long,device='cuda')
    
    masks_tensors = masks_tensors.masked_fill(
        tokens_index_tensors != 0, 1)
    masks_tensors=masks_tensors.cuda()
    return tokens_index_tensors, segments_index_tensors, masks_tensors, label_tensor

# Trial

In [None]:
# 因為要我們今天要跑的是中文QA 所以只有Bert可以用

# os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = torch.device('cuda:0')
checkpoint = 'bert-chinese-qa'
config_class, model_class, tokenizer_class = BertConfig, BertForQuestionAnswering, BertTokenizer
model = model_class.from_pretrained(checkpoint).to(device)
tokenizer = tokenizer_class.from_pretrained('bert-base-chinese', do_lower_case=True)

In [None]:
#此部分為設定訓練模型 對照LEEMENG文章第一點

# df_train = pd.read_csv("DRCDtraining_output3.csv",encoding="MS950")
# @tianjianjiang: where was `DRCDtraining_output3.csv`?
# I guess there were some preprocessed files in the GDrive that wasn't explicitly referenced in this notebook.
# So I use the original one instead.
df_train = pd.read_csv(
    f'{DATA_DIR_P}/BERT-Practice/DRCDtraining_output.csv',
    encoding='MS950',
    header=None,
    names=['document', 'question', 'answer'],
    error_bad_lines=False
  )

df_train.head(10)

In [None]:
df_train.columns

In [None]:
# 只用 1% 訓練數據看看 BERT 對少量標註數據有多少幫助
SAMPLE_FRAC = 1.00
df_train = df_train.sample(frac=SAMPLE_FRAC, random_state=9527)

In [None]:
# I presume the snippet below was for the missing `DRCDtraining_output3.csv`
# 去除不必要的欄位並重新命名兩標題的欄位名
# df_train = df_train.reset_index()
# df_train = df_train.loc[:, ['document ', ' question ',' answer']]
# df_train.columns = ['document', 'question', 'answer']

In [None]:
# 剔除過長的樣本以避免 BERT 無法將整個輸入序列放入記憶體不多的 GPU
MAX_LENGTH = 256
MAX_LENGTHQUE = 220

df_train = df_train[~(df_train.document.apply(lambda x : len(x)) > MAX_LENGTH)]
df_train = df_train[~(df_train.question.apply(lambda x : len(x)) > MAX_LENGTHQUE)]
# idempotence, 將處理結果另存成 tsv 供 PyTorch 使用
df_train.to_csv("train.tsv", sep="\t", index=False)

print("訓練樣本數：", len(df_train))
df_train.head()

In [None]:
# 初始化一個專門讀取訓練樣本的 Dataset，使用中文 BERT 斷詞
trainset = DRCDDataset("train", tokenizer=tokenizer)

In [None]:
sample_idx = 0

# 將原始文本拿出做比較
document , question, answer = trainset.df.iloc[sample_idx].values

# 利用剛剛建立的 Dataset 取出轉換後的 id tensors
tokens_tensor, segments_tensor, label_tensor = trainset[sample_idx]

# 將 tokens_tensor 還原成文本
tokens = tokenizer.convert_ids_to_tokens(tokens_tensor.tolist())
combined_text = "".join(tokens)

# 渲染前後差異，毫無反應就是個 print。可以直接看輸出結果
print(f"""[原始文本]
句子 1：{document}
句子 2：{question}
分類  ：{answer}

--------------------

[Dataset 回傳的 tensors]
tokens_tensor  ：{tokens_tensor}

segments_tensor：{segments_tensor}

label_tensor   ：{label_tensor}

--------------------

[還原 tokens_tensors]
{combined_text}
""")

In [None]:
# 初始化一個每次回傳 64 個訓練樣本的 DataLoader
# 利用 `collate_fn` 將 list of samples 合併成一個 mini-batch 是關鍵
BATCH_SIZE = 1
trainloader = DataLoader(trainset, batch_size=BATCH_SIZE, 
                         collate_fn=create_mini_batch)

## Always check the warnings

1. Click "View runtime logs"
2. Read them carefully
3. Search for similar issues and solutions

In [None]:
#此部分為實際訓練模型 對照LEEMENG文章第四點
#檢查設定
#model.config()
%%time

# 訓練模式
model.train()

# 使用 Adam Optim 更新整個分類模型的參數
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

EPOCHS = 6  # 幸運數字
for epoch in range(EPOCHS):
    
    running_loss = 0.0
    #cnt=0
    for data in trainloader:
    # if cnt >len(trainloader)-1:
    #   break
    
      for i in data:
        print(data[0][i])
        #tokens_tensor=data[0][i].to(device)
        #segments_tensor=data[1][i].to(device)
        #masks_tensor=data[2][i].to(device)
        #label_tensor=data[3][i].to(device)
        #tokens_tensors, segments_tensors, \
        #masks_tensors, label_tensor = (t.to(device) for t in data)
        #start_position=label_tensor[0]
        #end_position=label_tensor[len(label_tensor)-1]
    

        # 將參數梯度歸零
        optimizer.zero_grad()
        
        # forward pass
        outputs = model(input_ids=tokens_index_tensors, 
                        token_type_ids=segments_index_tensors, 
                        attention_mask=masks_tensors, 
                        start_position=start_position,end_position=end_position)

        loss = outputs[0]
        # backward
        loss.backward()
        optimizer.step()


        # 紀錄當前 batch loss
        running_loss += loss.item()
    #cnt=cnt+1    
    # 計算分類準確率
    _, acc = get_predictions(model, trainloader, compute_acc=True)

    print('[epoch %d] loss: %.3f, acc: %.3f' %
          (epoch + 1, running_loss, acc))

# To-be-determined

In [None]:
def cutctextver2(datatext,dataquestion,plen,lookback,pstartlen,answerget,ans):
    if answerget==True and pstartlen>=len(datatext[0]):
        print(dataquestion)
        print(ans)
        return
    elif pstartlen>=len(datatext[0]) and answerget==False:
        print(dataquestion)
        print('[UNKNOWN]')
        return
    else:
        cutt=datatext[0][pstartlen:pstartlen+plen]
        #print(cutt)
        data, tokens = convert_examples_to_features(tokenizer=tokenizer, question_text=dataquestion, doc_tokens=cutt)
        start, end = evaluate(data, model, tokenizer)
        knowans="".join(tokens[start[0]: end[0]+1])
        if (knowans!='[CLS]')and knowans!=''and knowans[0:5]!='[CLS]':
          answerget=True
          ansisget=knowans
          #ans=knowans
          return cutctextver2(datatext,dataquestion,plen,lookback,pstartlen+lookback,answerget,ansisget)
        return cutctextver2(datatext,dataquestion,plen,lookback,pstartlen+lookback,answerget,ans)

In [None]:
for i in range(0,len(PdContestQuestion_A)):
    context=np.array(PdContestQuestion_A[i:i+1]['DTEXT'])
    dfr=np.array(PdContestQuestion_A[i:i+1])
    print(dfr[0][0])
    print("Q"+str(i+1)+'.'+context[0])
    for j in range(len(dfr[0][2])):
        question=dfr[0][2][j]['QTEXT']
        print(question[3:len(question)])

In [None]:
for i in range(0,len(PdContestQuestion_A)):
    context=np.array(PdContestQuestion_A[i:i+1]['DTEXT'])
    dfr=np.array(PdContestQuestion_A[i:i+1])
    for j in range(len(dfr[0][2])):
        question=dfr[0][2][j]['QTEXT']
        X.append([context,questioncut])
print(X[1][1])

In [None]:
contexttest=np.array(PdContestQuestion_A[0:1]['DTEXT'])
dft=np.array(PdContestQuestion_A[0:1])
question=dft[0][2][1]['QTEXT']
print(dft[0][2][1]['QTYPE'])
ans=""
cutctextver2(contexttest,question,256,50,0,False,ans)

In [None]:
#testver3
ans=""
for i in range(0,len(PdContestQuestion_A)):
    context=np.array(PdContestQuestion_A[i:i+1]['DTEXT'])
    dfr=np.array(pddata[i:i+1])
    print(dfr[0][0])
    print("Q"+str(i+1)+'.'+context[0])
    for j in range(len(dfr[0][2])):
        print(dfr[0][2][j]['QTYPE'])
        question=dfr[0][2][j]['QTEXT']
        cutctextver2(context,question,256,50,0,False,ans)

In [None]:
ans=""
for i in range(0,len(pddata)):
    context=np.array(pddata[i:i+1]['DTEXT'])
    dfr=np.array(pddata[i:i+1])
    print(dfr[0][0])
    print("Q"+str(i+1)+'.'+context[0])
    for j in range(len(dfr[0][2])):
        question=dfr[0][2][j]['QTEXT']
        cutctextver2(context,question,256,50,0,False,ans)