### Question and Answering - with bert

#### bert qa setup

In [1]:
# https://medium.com/analytics-vidhya/question-answering-system-with-bert-ebe1130f8def#:~:text=Question%20Answering%20System%20using%20BERT&text=For%20the%20Question%20Answering%20System,embeddings%20and%20the%20segment%20embeddings.
# qa_bert https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad/tree/main

In [2]:
import torch

In [3]:
from transformers import BertForQuestionAnswering

In [4]:
from transformers import BertTokenizer

In [5]:
from transformers import pipeline

In [6]:
from pathlib import Path
import pandas as pd

#### load clean dataset

In [7]:
base_path = Path("/home/jupyter/deemed_consent")
data_dir = Path("data")
input_notes_path = base_path / data_dir / "clean_notes_context_7days.csv"
matrix_path = base_path / data_dir / "scenario_matrix.csv"

In [8]:
df = pd.read_csv(input_notes_path)

In [9]:
df.iloc[10]

project_id                                                               J5R2H
siebel_order_number                                        OR013-1220026349626
service_id                                                        ONEA74227797
delay_id                                                               1661781
reason_code                                                               2002
reason_text                                           Insufficient Information
event_type                                                         Delay Draft
delay_status                                                             Draft
delay_notes                  Reasonable assistance or information is requir...
actual_delay_start_date                                                    NaN
actual_delay_closure_date                                                  NaN
event_author                                                         614400623
event_timestamp                                 2023

#### pick observation #

In [10]:
n = 5

#### question & context

In [11]:
# question = "Why is there a delay?"
# question = "Why is there a delay to the provision of Ethernet services?" ### for delay_notes
question = "What is the most recent reason for a delay to the provision of Ethernet services?" ### for clean_llm_context

In [12]:
#context = df.delay_notes[n] 
context = df.clean_llm_context[n]
print(context)

2023-11-15 - General
[PERSON]s ent to Bloomfield-Henderson,F,[PERSON] RHi [PERSON]I hope all is well.The Communications Provider has provided information to resolve delay below:Can you please book in and confirm fibre appointment and remove delay.Hi [PERSON],Can you use this code [PERSON] this doesnot required PO, Hope Appointment date confirmed for 21/11Many thanks,DivyaKind Regards [PERSON] Craig

2023-11-15 - General
SI Ref: C74755088 Case update CCT ID: ONEA11045549 - [CP] Vodafone Limited Case Update Good afternoon [PERSON] ,I hope all is well. I have received your query and have emailed the fibre coordinator of this order to please remove the delay and confirm an appointment for the fibre installation to go ahead.Please to allow three days for a response.Next update will be given by close of play: 20/11/23

2023-11-17 - General
[PERSON] received from <EMAIL> advising PO not required but to use code [PERSON] to get through the request. Done so and access has been submitted under C

#### model

In [13]:
#Model
model = BertForQuestionAnswering.from_pretrained('/home/jupyter/projects/deemed-consent/qa_bert')

#Tokenizer
tokenizer = BertTokenizer.from_pretrained('/home/jupyter/projects/deemed-consent/qa_bert')

Some weights of the model checkpoint at /home/jupyter/projects/deemed-consent/qa_bert were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### answer

In [14]:
qa_model = pipeline("question-answering", model=model,tokenizer=tokenizer)
qa_ans = qa_model(question = question, context = context)

In [15]:
answer = qa_ans['answer']
print(answer)

Unexpected customer delay


#### reason code - spaCy similarity

In [16]:
import spacy
nlp = spacy.load("/opt/conda/lib/python3.10/site-packages/en_core_web_lg/en_core_web_lg-3.8.0")

In [17]:
scenario_df = pd.read_csv(matrix_path, na_filter = False,
     dtype={
        "scenario_id": "string",
        "scenario": "string",
        "dc_code": "string",
        "intention_template_row": "string",
        "type": "string",
    }
)

In [18]:
# check the answer against each scenario
doc2 = nlp(answer)
results = []
for index, row in scenario_df.iterrows():
    scenario = row['scenario']
    dc_code = row['dc_code']
    doc1 = nlp(scenario)
    # Similarity of two documents
    similarity_score = doc1.similarity(doc2)
    result = { 'scenario': scenario, 'similarity_score': similarity_score, 'dc_code': dc_code}
    results.append(result)
results_df = pd.DataFrame(results)

#### pick the most similar scenario & compare to the reason

In [22]:
results_df.sort_values(by='similarity_score',ascending=False).head(10)

Unnamed: 0,scenario,similarity_score,dc_code
193,Special contractor requests required to progre...,0.696666,"4002, 2002"
61,Order is delayed awaiting newsites / BDUK infr...,0.68812,9582
57,"The order requires downtime, but none supplied.",0.681349,3010
58,The planner issues a no survey required order ...,0.68011,2002
159,Order is delayed awaiting spine planning,0.677803,2042
68,There is no power for the NTE/ customer power ...,0.675258,9582
0,The customer contact does not know how to arra...,0.673727,2002
120,"RO2 CCDs mis-aligned, circuit is delayed at ES...",0.670542,2042
21,Appointment cannot be booked as customer site ...,0.66743,2012
30,Modify order has missing details from previous...,0.665771,none


In [20]:
df.reason_code[n]

9582

In [21]:
results_df.sort_values(by='similarity_score',ascending=False).head(1).dc_code == str(df.reason_code[n])

193    False
Name: dc_code, dtype: bool

#### qu: what % does bert q&a & spaCy similarity get right using either clean_llm_context or delay_note