## DistilBERT Question And Answer - PreEntrenado con el dataset Stanford Question Answering Dataset

Para la implementación inicial de BERT usaremos inicialmente DistilBERT y la librería simpletransformers para un implementación predefinida de Question and Answer basada en BERT. El modelo pre-entrenado de BERT elegido será distilbert-base-uncased-distilled-squad, el cual es entrenado con un extenso dataset de la universidad de Standford enfocado a problemas QA.

In [1]:
import pandas as pd
import numpy as np
import json
import re

In [2]:
train = pd.read_csv("/kaggle/input/tweet-sentiment-extraction/train.csv")
test = pd.read_csv("/kaggle/input/tweet-sentiment-extraction/test.csv")
sample_submission = pd.read_csv("/kaggle/input/tweet-sentiment-extraction/sample_submission.csv")

In [3]:
train.shape, test.shape

((27481, 4), (3534, 3))

In [4]:
#Reference https://www.kaggle.com/parulpandey/eda-and-preprocessing-for-bert

def clean(tweet):
    tweet = str(tweet)

    tweet=tweet.lower()

    #Remove html tags
    tweet=re.sub('<.*?>','',tweet)

    #Remove text in square brackets
    tweet=re.sub('\[.*?\]','',tweet)

    #Remove hyperlinks
    tweet=re.sub('https?://\S+|www\.\S+','',tweet)


    return tweet

In [5]:
train.dropna(inplace = True)
train["text"] = train["text"].apply(lambda x : x.strip())
train["selected_text"] = train["selected_text"].apply(lambda x : x.strip())
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on th...","Sons of ****,",negative


In [6]:
from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test=train_test_split(train[['text','textID','sentiment']],train['selected_text'],
                                               test_size=0.2,random_state=42,stratify=train['sentiment'])


X_train.reset_index(inplace=True,drop=True)
X_test.reset_index(inplace=True,drop=True)

Y_train=Y_train.reset_index(drop=True)
Y_test=Y_test.reset_index(drop=True)

print('X_train Forma',X_train.shape,' Y_train Forma ',Y_train.shape)
print('X_test Forma',X_test.shape,' Y_test Forma ',Y_test.shape)

X_train Forma (21984, 3)  Y_train Forma  (21984,)
X_test Forma (5496, 3)  Y_test Forma  (5496,)


In [7]:
X_train_Temp = X_train.copy()
X_train_Temp['selected_text'] = Y_train

In [8]:
X_test_Temp = X_test.copy()
X_test_Temp['selected_text'] = Y_test

In [9]:
X_train_Temp = X_train_Temp[['textID', 'text', 'selected_text', 'sentiment']]
X_train_Temp.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,ee181b36fe,Press `Ctrl` on bottom right. It`s there. KY,Press `Ctrl` on bottom right. It`s there. KY,neutral
1,989f65a4aa,ah remember the days when you`d sleep in until...,loser,negative
2,7669dc1086,i have a whole day planned for my mom today th...,will love!,positive
3,40198f86d1,I do that all the time,I do that all the time,neutral
4,836b055959,Twitter`s being lame and won`t post my twitpic...,lame,negative


In [10]:
X_test_Temp = X_test_Temp[['textID', 'text', 'selected_text', 'sentiment']]
X_test_Temp.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,45be0423e4,I thought that there was going to be another D...,crappy karaoke game. I miss the fighting,negative
1,521d5dd501,I bet you received lots of hit from that tweet...,I bet you received lots of hit from that tweet...,negative
2,605225ad21,Freakin` frustrated why can`t my coach realize...,frustrated,negative
3,0abe62c2ee,is feeling so bored... i miss school time,is feeling so bored..,negative
4,eca513ce47,wow this morning 8.15 hrs ding dong breakfasts...,"Mother hapy,",positive


In [11]:
train_array = np.array(X_train_Temp)
test_array = np.array(X_test_Temp)
use_cuda = True

In [12]:
# Búsqueda de indice de inicio
def start_index(text, selected_text):
    start_index = text.lower().find(selected_text.lower())
    l.append(start_index)
    
l = []
for i in range(len(train_array)):
    start_index(train_array[i, 1], train_array[i, 2])

In [13]:
# pregunta --> sentimiento
# contexto --> texto tweet
# respuesta --> texto seleccionado

def quesa_format_train(train):
    out = []
    for i, row in enumerate(train):
        qas = []
        con = []
        ans = []
        question = row[-1]
        answer = row[2]
        context = row[1]
        qid = row[0]
        answer_start = l[i]
        ans.append({"answer_start": answer_start, "text": answer.lower()})
        qas.append({"question": question, "id": qid, "is_impossible": False, "answers": ans})
        out.append({"context": context.lower(), "qas": qas})

    return out
        
    
train_json_format = quesa_format_train(train_array)
with open('train.json', 'w') as outfile:
    json.dump(train_json_format, outfile)

In [14]:
# Similar a los datos de entrenamiento

def quesa_format_test(train):
    out = []
    for i, row in enumerate(train):
        qas = []
        con = []
        ans = []
        question = row[-1]
#         answer = row[2]
        context = row[1]
        qid = row[0]
        answer_start = l[i]
        ans.append({"answer_start": 1000000, "text": "__None__"})
        qas.append({"question": question, "id": qid, "is_impossible": False, "answers": ans})
        out.append({"context": context.lower(), "qas": qas})
    return out
        
    
test_json_format = quesa_format_test(test_array)

with open('test.json', 'w') as outfile:
    json.dump(test_json_format, outfile)

In [15]:
!pip install '../input/simple-transformers-pypi/seqeval-0.0.12-py3-none-any.whl' -q
!pip install '../input/simple-transformers-pypi/simpletransformers-0.22.1-py3-none-any.whl' -q

In [16]:
from simpletransformers.question_answering import QuestionAnsweringModel

model_path = '/kaggle/input/transformers-pretrained-distilbert/distilbert-base-uncased-distilled-squad/'
model_path_ready = './model-distilbert'

# Creación del modelo
model = QuestionAnsweringModel('distilbert', 
                               model_path, 
                               args={'reprocess_input_data': True,
                                     'overwrite_output_dir': True,
                                     'learning_rate': 5e-5,
                                     'num_train_epochs': 4,
                                     'max_seq_length': 128,
                                     'output_dir': './model-distilbert',
                                     'doc_stride': 64,
                                     'fp16': False,
                                    },
                              use_cuda=use_cuda)

model.train_model('train.json')

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at /kaggle/input/transformers-pretrained-distilbert/distilbert-base-uncased-distilled-squad/ and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 21984/21984 [00:34<00:00, 639.52it/s]


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=4.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=2748.0, style=ProgressStyle(descr…

Running loss: 3.677236



Running loss: 0.755826


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=2748.0, style=ProgressStyle(descr…

Running loss: 0.511390


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=2748.0, style=ProgressStyle(descr…

Running loss: 0.698674


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=2748.0, style=ProgressStyle(descr…

Running loss: 0.183927



In [17]:
pred = model.predict(test_json_format)

100%|██████████| 5496/5496 [00:07<00:00, 777.11it/s]


HBox(children=(FloatProgress(value=0.0, max=687.0), HTML(value='')))




In [18]:
df = pd.DataFrame.from_dict(pred)
df_final = X_test_Temp.copy()
df_final['pred'] =  df['answer']

In [19]:
def jaccard(str1, str2):
  a = set(str(str1).lower().split()) 
  b = set(str(str2).lower().split())
  c = a.intersection(b)
  return float(len(c)) / (len(a) + len(b) - len(c))

In [20]:
def compute_jaccard(Y):
    all_jaccard = []
    for i in range(len(Y)):
        score = jaccard(Y.iloc[i]["selected_text"], Y.iloc[i]["pred"])
        all_jaccard.append(score)
    return np.mean(np.array(all_jaccard))

In [21]:
score_total = compute_jaccard(df_final)
score_total

0.6951614840442226