<a href="https://colab.research.google.com/github/MarisabelC/QA_Transformer/blob/main/Transformer_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Set Up**

 Installing Huggingface package

Importing  pythorch


In [None]:
#url: https://huggingface.co/transformers/model_doc/bert.html#bertforquestionanswering

#install huggingface transformer library 
!pip install transformers
from transformers import BertForQuestionAnswering,BertTokenizer


#Pytorch is a Python-based scientific computing package targeted at two sets of audiences:
# A replacement for NumPy to use the power of GPUs
# a deep learning research platform that provides maximum flexibility and speed
import torch 

**Load BERT pretrained Model and Tokenizer**

**Encode the question and context**

SQuAD - Stanford Question Answering Dataset 

*Huggingface Pretrained Models*

https://huggingface.co/transformers/pretrained_models.html


In [None]:
#load fine-tuned Bert
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model.to('cuda')

#load tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

question_list = ['What percent of New York City’s population has the Coronavirus?','What kind of testing does New York city use?','Who is New York City’s top official for disease control?']

context ='One of every five New York City residents tested positive for antibodies to the coronavirus, according to preliminary results described by Gov. Andrew M. Cuomo on Thursday that suggested that the virus had spread far more widely than known. If the pattern holds, the results from random testing of 3,000 people raised the tantalizing prospect that many New Yorkers — as many as 2.7 million, the governor said — who never knew they had been infected had already encountered the virus, and survived. Mr. Cuomo also said that such wide infection might mean that the death rate was far lower than believed. While the reliability of some early antibody tests has been widely questioned, researchers in New York have worked in recent weeks to develop and validate their own antibody tests, with federal approval. State officials believe that accurate antibody testing is seen as a critical tool to help determine when and how to begin restarting the economy, and sending people back to work. "The testing also can tell you the infection rate in the population — where it\'s higher, where it\'s lower — to inform you on a reopening strategy," Mr. Cuomo said. "Then when you start reopening, you can watch that infection rate to see if it\'s going up and if it\'s going up, slow down." The testing in New York is among several efforts by public health officials around the country to determine how many people may have been already exposed to the virus, beyond those who have tested positive. He said that while concerns about some tests on the market were valid, the state\'s test was reliable enough to determine immunity — and, possibly, send people back to the office.'

# Apply the tokenizer to the input text, treating them as a text-pair.
input_ids = tokenizer.encode(question_list[0], context)

**Tokenizer's behavior**

Special Tokens:


*   [CLS] - The classifier token which is used when doing sequence classification
*   [UNK] – The unknown token
*   [SEP]– The separator token
*   [PAD]– The token used for padding
*   [MASK]”) – The token used for masking values



In [None]:
tokens = tokenizer.convert_ids_to_tokens(input_ids)
for token, id in zip(tokens, input_ids):
  print('{:<20} {:>6,}'.format(token, id))
  # If this is the [SEP] token, add ************ 
  if id == tokenizer.sep_token_id:
      print('*****************')

**Segment embeddings**

Before the word embeddings go into the BERT layers, we use segment embedding, so BERT can distinguish between question and context


In [None]:
# Search the input_ids for the first instance of the `[SEP]` token.
sep_index = input_ids.index(tokenizer.sep_token_id)

# The number of tokens that are contained in question includes the [SEP].
num_seg_question = sep_index + 1

# The number of tokens that are contained in context includes the [SEP].
num_seg_context = len(input_ids) - num_seg_question

# Construct the list of 0s for the question and 1s for the context.
segment_ids = [0]*num_seg_question + [1]*num_seg_context

# There should be a segment_id for every input token.
assert len(segment_ids) == len(input_ids)

**Feed BERT model**


In [None]:
# The tokens representing our input text.
input_tensor = torch.tensor([input_ids]).to('cuda')
#['CLS','where','did','obama'.....'sep','president']

# The segment IDs to differentiate question from answer_text
segment_tensor = torch.tensor([segment_ids]).to('cuda')
#[0,0,0,....,1,1....]
start_scores, end_scores = model(input_tensor,token_type_ids=segment_tensor) 

**Get answer**

In [None]:
# Find the tokens with the highest `start` and `end` scores.
answer_start = torch.argmax(start_scores) 
answer_end = torch.argmax(end_scores)


# Combine the tokens in the answer and print it out.
answer = tokens[answer_start]

# Select the remaining answer tokens and join them with whitespace.
for i in range(answer_start + 1, answer_end + 1):
    
    # If it's a subword token, then recombine it with the previous token.
    if tokens[i][0:2] == '##':
        answer += tokens[i][2:]
 #non ##vio ##lent   
    # Otherwise, add a space then the token.
    else:
        answer += ' ' + tokens[i]
print(question_list[0])
print('Answer: "' + answer.title() + '"')

What percent of New York City’s population has the Coronavirus?
Answer: "One Of Every Five"


In [None]:
def get_answer(question,context):
   
  input_ids = tokenizer.encode(question, context)
  sep_index = input_ids.index(tokenizer.sep_token_id)
  
  num_seg_question = sep_index + 1
  num_seg_context = len(input_ids) - num_seg_question
  segment_ids = [0]*num_seg_question + [1]*num_seg_context
  
  input_tensor =torch.tensor([input_ids]).to('cuda')
  segment_tensor = torch.tensor([segment_ids]).to('cuda')
  start_scores, end_scores = model(input_tensor,token_type_ids=segment_tensor) 
  
  answer_start = torch.argmax(start_scores)
  answer_end = torch.argmax(end_scores)
  answer = tokens[answer_start]

  for i in range(answer_start + 1, answer_end + 1):
      if tokens[i][0:2] == '##':
          answer += tokens[i][2:]
      else:
          answer += ' ' + tokens[i]

  return answer.title()


In [None]:
for question in question_list:
  print(question)
  print('answer:',get_answer(question,context)+'\n')

In [None]:
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
import torch

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased',return_token_type_ids = True)
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')

context = "The US has passed the peak on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world."
question = "What was President Donald Trump's prediction?"
encoding = tokenizer.encode_plus(question, context)

input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]
start_scores, end_scores = model(torch.tensor([input_ids]), attention_mask=torch.tensor([attention_mask]))

ans_tokens = input_ids[torch.argmax(start_scores) : torch.argmax(end_scores)+1]
answer_tokens = tokenizer.convert_ids_to_tokens(ans_tokens , skip_special_tokens=True)

all_tokens = tokenizer.convert_ids_to_tokens(input_ids)

print ("\nAnswer Tokens: ")
print (answer_tokens)

answer_tokens_to_string = tokenizer.convert_tokens_to_string(answer_tokens)

print ("\nFinal Answer : ")
print (answer_tokens_to_string)

In [None]:
import urllib.request
from bs4 import BeautifulSoup

def get_context(url):
  html = urllib.request.urlopen(url).read()
  soup = BeautifulSoup(html)


  # kill all script and style elements
  for script in soup(["script", "style"]):
      script.extract()    # rip it out

  # get text
  text = soup.get_text(separator=' ')

  # break into lines and remove leading and trailing space on each
  lines = (line.strip() for line in text.splitlines())
  # break multi-headlines into a line each
  chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
  # drop blank lines
  context = '\n'.join(chunk for chunk in chunks if chunk)
  
  # print(context)
  return context

In [None]:
#https://huggingface.co/deepset/roberta-base-squad2

from transformers.pipelines import pipeline
from transformers.modeling_auto import AutoModelForQuestionAnswering
from transformers.tokenization_auto import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
nlp_qa = pipeline('question-answering', model=model, tokenizer=tokenizer)
# context = 'One of every five New York City residents tested positive for antibodies to the coronavirus, according to preliminary results described by Gov. Andrew M. Cuomo on Thursday that suggested that the virus had spread far more widely than known. If the pattern holds, the results from random testing of 3,000 people raised the tantalizing prospect that many New Yorkers — as many as 2.7 million, the governor said — who never knew they had been infected had already encountered the virus, and survived. Mr. Cuomo also said that such wide infection might mean that the death rate was far lower than believed. While the reliability of some early antibody tests has been widely questioned, researchers in New York have worked in recent weeks to develop and validate their own antibody tests, with federal approval. State officials believe that accurate antibody testing is seen as a critical tool to help determine when and how to begin restarting the economy, and sending people back to work. “The testing also can tell you the infection rate in the population — where it’s higher, where it’s lower — to inform you on a reopening strategy,” Mr. Cuomo said. “Then when you start reopening, you can watch that infection rate to see if it’s going up and if it’s going up, slow down.” The testing in New York is among several efforts by public health officials around the country to determine how many people may have been already exposed to the virus, beyond those who have tested positive. The results appear to conform with research from Northeastern University that indicated that the coronavirus was circulating by early February in the New York area and other major cities. In California, a study using antibody testing found rates of exposure as high as 4 percent in Santa Clara County — higher than those indicated by infection tests, though not nearly as high as found in New York. Public health officials recently disclosed that a woman in Santa Clara who died on Feb. 6 was infected with the virus. In New York City, about 21 percent tested positive for coronavirus antibodies during the state survey. The rate was about 17 percent on Long Island, nearly 12 percent in Westchester and Rockland Counties and less than 4 percent in the rest of the state. State researchers sampled blood from the approximately 3,000 people they had tested over two days, including about 1,300 in New York City, at grocery and big-box stores. The results were sent to the state’s Wadsworth facility in Albany, a respected public health lab. Dr. Howard A. Zucker, the state health commissioner, said the lab had set a high bar for determining positive results, that it had been given blanket approval to develop coronavirus tests by the Food and Drug Administration and that state officials discussed this particular antibody test with the agency. He said that while concerns about some tests on the market were valid, the state’s test was reliable enough to determine immunity — and, possibly, send people back to the office. “It is a way to say this person had the disease and they can go back into the work force,” Dr. Zucker said. “A strong test like we have can tell you that you have antibodies.” But he cautioned that the length of any such immunity remained unknown. “The amount of time, we need to see. We don’t know that yet,” he said, adding, “They will last a while.” Unlike so-called diagnostic tests, which determine whether someone is infected, often using nasal swabs, blood tests for Covid-19 antibodies are intended to reveal whether a person was previously exposed and has developed an immune response. Some tests also measure the amount of antibodies present. Hours before Mr. Cuomo’s presentation, a top health official in New York City expressed general skeptical about the utility of antibody tests — especially those on the private market — when it comes to questions of immunity and critical decisions over social distancing and reopening the economy. Dr. Demetre C. Daskalakis, the city’s top official for disease control, wrote in an email alert on Wednesday that such tests “may produce false negative or false positive results,” pointing to “significant voids” in using the science to pinpoint immunity. The alert, sent to medical providers and other subscribers, went on to warn that the consequences of relying on potentially false results may lead to “providing patients incorrect guidance on preventive interventions like physical distancing or protective equipment.”'


def get_answer(questions, url= None, context=None):
  if (url != None):
    context = get_context(url)
  for question in questions:
    answer_map = nlp_qa(context=context, question=question)
    print(question, ' answer: ',answer_map['answer'], '. score:', answer_map['score'])


In [None]:
#COVID 19
questions =['What percent of New York City’s population has the Coronavirus?','What kind of testing does New York city use?','Who is New York City’s top official for disease control?']
get_answer(questions, url = "https://www.nytimes.com/2020/04/23/nyregion/coronavirus-antibodies-test-ny.html")

In [None]:
# COVID 19
questions =['What percent of New York City’s population has the Coronavirus?','What kind of testing does New York city use?','Who is New York City’s top official for disease control?']
context = 'One of every five New York City residents tested positive for antibodies to the coronavirus, according to preliminary results described by Gov. Andrew M. Cuomo on Thursday that suggested that the virus had spread far more widely than known. If the pattern holds, the results from random testing of 3,000 people raised the tantalizing prospect that many New Yorkers — as many as 2.7 million, the governor said — who never knew they had been infected had already encountered the virus, and survived. Mr. Cuomo also said that such wide infection might mean that the death rate was far lower than believed. While the reliability of some early antibody tests has been widely questioned, researchers in New York have worked in recent weeks to develop and validate their own antibody tests, with federal approval. State officials believe that accurate antibody testing is seen as a critical tool to help determine when and how to begin restarting the economy, and sending people back to work. “The testing also can tell you the infection rate in the population — where it’s higher, where it’s lower — to inform you on a reopening strategy,” Mr. Cuomo said. “Then when you start reopening, you can watch that infection rate to see if it’s going up and if it’s going up, slow down.” The testing in New York is among several efforts by public health officials around the country to determine how many people may have been already exposed to the virus, beyond those who have tested positive. The results appear to conform with research from Northeastern University that indicated that the coronavirus was circulating by early February in the New York area and other major cities. In California, a study using antibody testing found rates of exposure as high as 4 percent in Santa Clara County — higher than those indicated by infection tests, though not nearly as high as found in New York. Public health officials recently disclosed that a woman in Santa Clara who died on Feb. 6 was infected with the virus. In New York City, about 21 percent tested positive for coronavirus antibodies during the state survey. The rate was about 17 percent on Long Island, nearly 12 percent in Westchester and Rockland Counties and less than 4 percent in the rest of the state. State researchers sampled blood from the approximately 3,000 people they had tested over two days, including about 1,300 in New York City, at grocery and big-box stores. The results were sent to the state’s Wadsworth facility in Albany, a respected public health lab. Dr. Howard A. Zucker, the state health commissioner, said the lab had set a high bar for determining positive results, that it had been given blanket approval to develop coronavirus tests by the Food and Drug Administration and that state officials discussed this particular antibody test with the agency. He said that while concerns about some tests on the market were valid, the state’s test was reliable enough to determine immunity — and, possibly, send people back to the office. “It is a way to say this person had the disease and they can go back into the work force,” Dr. Zucker said. “A strong test like we have can tell you that you have antibodies.” But he cautioned that the length of any such immunity remained unknown. “The amount of time, we need to see. We don’t know that yet,” he said, adding, “They will last a while.” Unlike so-called diagnostic tests, which determine whether someone is infected, often using nasal swabs, blood tests for Covid-19 antibodies are intended to reveal whether a person was previously exposed and has developed an immune response. Some tests also measure the amount of antibodies present. Hours before Mr. Cuomo’s presentation, a top health official in New York City expressed general skeptical about the utility of antibody tests — especially those on the private market — when it comes to questions of immunity and critical decisions over social distancing and reopening the economy. Dr. Demetre C. Daskalakis, the city’s top official for disease control, wrote in an email alert on Wednesday that such tests “may produce false negative or false positive results,” pointing to “significant voids” in using the science to pinpoint immunity. The alert, sent to medical providers and other subscribers, went on to warn that the consequences of relying on potentially false results may lead to “providing patients incorrect guidance on preventive interventions like physical distancing or protective equipment.”'
get_answer(questions, context = context)

In [None]:
#Obama 
questions=['Where did Obama receive the nobel peace prize?', 'How did Obama justify the wars against terror?','When is war justifiable according to Obama?']
get_answer(questions, url ='https://www.npr.org/templates/story/story.php?storyId=121276209')


In [None]:
#Obama
questions=['Where did Obama receive the nobel peace prize?', 'How did Obama justify the wars against terror?','When is war justifiable according to Obama?']
context = 'President Obama accepted the Nobel Peace Prize during a ceremony Thursday in Norway, acknowledging the paradox of receiving the award as the U.S. is embroiled in two wars, while maintaining that instruments of war have a role in preserving peace. In his acceptance speech, Obama told Nobel Committee members and guests in Oslo that achieving peace must begin with the recognition that the use of force is sometimes morally justified. "Make no mistake: Evil does exist in the world. A nonviolent movement could not have halted Hitler\'s armies. Negotiations cannot convince al-Qaida\'s leaders to lay down their arms," he told the crowd. It was just nine days ago that Obama announced he is sending an additional 30,000 U.S. troops to Afghanistan in an effort to step up training of Afghan security forces and root out insurgents operating on the border with Pakistan. "I understand why war is not popular, but I also know this: The belief that peace is desirable is rarely enough to achieve it," he said, urging support for NATO and saying peacekeeping responsibilities shouldn\'t be left to a few countries. The president said war is justified in cases of self-defense, when civilians are being slaughtered by their own government, or a civil war threatens to engulf an entire region. Accompanied by first lady Michelle Obama, the president struck a humble tone upon arriving in the Norwegian capital after a seven-hour flight from Washington, D.C. He acknowledged the controversy surrounding the Nobel Committee\'s decision to honor him less than a year into his presidency, saying he knew there were others more deserving of the honor. During his speech, Obama took note of the "giants of history" who have been honored with the Peace Prize, including humanitarian Albert Schweitzer, civil rights leader Martin Luther King Jr. and Red Cross founder Henry Dunant. "My accomplishments are slight" by comparison, he said. At a news conference with Norwegian Prime Minister Jens Stoltenberg earlier in the day, Obama vowed to use the award to advance his goals for peace.'
get_answer(questions, context= context)

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
from spacy.lang.en import English


SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
OBJECTS = ["dobj", "dative", "attr", "oprd"]

def getSubsFromConjunctions(subs):
    moreSubs = []
    for sub in subs:
        # rights is a generator
        rights = list(sub.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if "and" in rightDeps:
            moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
            if len(moreSubs) > 0:
                moreSubs.extend(getSubsFromConjunctions(moreSubs))
    return moreSubs

def getObjsFromConjunctions(objs):
    moreObjs = []
    for obj in objs:
        # rights is a generator
        rights = list(obj.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if "and" in rightDeps:
            moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
            if len(moreObjs) > 0:
                moreObjs.extend(getObjsFromConjunctions(moreObjs))
    return moreObjs

def getVerbsFromConjunctions(verbs):
    moreVerbs = []
    for verb in verbs:
        rightDeps = {tok.lower_ for tok in verb.rights}
        if "and" in rightDeps:
            moreVerbs.extend([tok for tok in verb.rights if tok.pos_ == "VERB"])
            if len(moreVerbs) > 0:
                moreVerbs.extend(getVerbsFromConjunctions(moreVerbs))
    return moreVerbs

def findSubs(tok):
    head = tok.head
    while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
        head = head.head
    if head.pos_ == "VERB":
        subs = [tok for tok in head.lefts if tok.dep_ == "SUB"]
        if len(subs) > 0:
            verbNegated = isNegated(head)
            subs.extend(getSubsFromConjunctions(subs))
            return subs, verbNegated
        elif head.head != head:
            return findSubs(head)
    elif head.pos_ == "NOUN":
        return [head], isNegated(tok)
    return [], False

def isNegated(tok):
    negations = {"no", "not", "n't", "never", "none"}
    for dep in list(tok.lefts) + list(tok.rights):
        if dep.lower_ in negations:
            return True
    return False

def findSVs(tokens):
    svs = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        if len(subs) > 0:
            for sub in subs:
                svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
    return svs

def getObjsFromPrepositions(deps):
    objs = []
    for dep in deps:
        if dep.pos_ == "ADP" and dep.dep_ == "prep":
            objs.extend([tok for tok in dep.rights if tok.dep_  in OBJECTS or (tok.pos_ == "PRON" and tok.lower_ == "me")])
    return objs

def getObjsFromAttrs(deps):
    for dep in deps:
        if dep.pos_ == "NOUN" and dep.dep_ == "attr":
            verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
            if len(verbs) > 0:
                for v in verbs:
                    rights = list(v.rights)
                    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
                    objs.extend(getObjsFromPrepositions(rights))
                    if len(objs) > 0:
                        return v, objs
    return None, None

def getObjFromXComp(deps):
    for dep in deps:
        if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
            v = dep
            rights = list(v.rights)
            objs = [tok for tok in rights if tok.dep_ in OBJECTS]
            objs.extend(getObjsFromPrepositions(rights))
            if len(objs) > 0:
                return v, objs
    return None, None

def getAllSubs(v):
    verbNegated = isNegated(v)
    subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
    if len(subs) > 0:
        subs.extend(getSubsFromConjunctions(subs))
    else:
        foundSubs, verbNegated = findSubs(v)
        subs.extend(foundSubs)
    return subs, verbNegated

def getAllObjs(v):
    # rights is a generator
    rights = list(v.rights)
    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
    objs.extend(getObjsFromPrepositions(rights))

    #potentialNewVerb, potentialNewObjs = getObjsFromAttrs(rights)
    #if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
    #    objs.extend(potentialNewObjs)
    #    v = potentialNewVerb

    potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
    if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
        objs.extend(potentialNewObjs)
        v = potentialNewVerb
    if len(objs) > 0:
        objs.extend(getObjsFromConjunctions(objs))
    return v, objs

def findSVOs(tokens):

    print(tokens)

    for token in tokens:
      print(token,' :' ,token.pos_, token.dep_)

    verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
    print(verbs)
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            v, objs = getAllObjs(v)
            for sub in subs:
                for obj in objs:
                    objNegated = isNegated(obj)
                    svos.append((sub.lower_, "!" + v.lower_ if verbNegated or objNegated else v.lower_, obj.lower_))
    return svos

def getAbuserOntoVictimSVOs(tokens):
    maleAbuser = {'he', 'boyfriend', 'bf', 'father', 'dad', 'husband', 'brother', 'man'}
    femaleAbuser = {'she', 'girlfriend', 'gf', 'mother', 'mom', 'wife', 'sister', 'woman'}
    neutralAbuser = {'pastor', 'abuser', 'offender', 'ex', 'x', 'lover', 'church', 'they'}
    victim = {'me', 'sister', 'brother', 'child', 'kid', 'baby', 'friend', 'her', 'him', 'man', 'woman'}

    svos = findSVOs(tokens)
    wnl = WordNetLemmatizer()
    passed = []
    for s, v, o in svos:
        s = wnl.lemmatize(s)
        v = "!" + wnl.lemmatize(v[1:], 'v') if v[0] == "!" else wnl.lemmatize(v, 'v')
        o = "!" + wnl.lemmatize(o[1:]) if o[0] == "!" else wnl.lemmatize(o)
        if s in maleAbuser.union(femaleAbuser).union(neutralAbuser) and o in victim:
            passed.append((s, v, o))
    return passed

def printDeps(toks):
    for tok in toks:
        print(tok.orth_, tok.dep_, tok.pos_, tok.head.orth_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])

def testSVOs():
    nlp = spacy.load("en_core_web_sm")

    # tok = nlp("making $12 an hour? where am i going to go? i have no other financial assistance available and he certainly won't provide support.")
    # svos = findSVOs(tok)
    # printDeps(tok)
    # assert set(svos) == {('i', '!have', 'assistance'), ('he', '!provide', 'support')}
    # print(svos)

    tok = nlp("Who is New York City top official for disease control ?")
    svos = findSVOs(tok)
    printDeps(tok)
    assert set(svos) == {('i', '!have', 'assistance')}

    print("-----------------------------------------------")
    tok = nlp("They ate the pizza with anchovies.")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('they', 'ate', 'pizza')}

    print("--------------------------------------------------")
    tok = nlp("I have no other financial assistance available and he certainly won't provide support.")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('i', '!have', 'assistance'), ('he', '!provide', 'support')}

    print("--------------------------------------------------")
    tok = nlp("I have no other financial assistance available, and he certainly won't provide support.")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('i', '!have', 'assistance'), ('he', '!provide', 'support')}

    print("--------------------------------------------------")
    tok = nlp("he did not kill me")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    # assert set(svos) == {('he', '!kill', 'me')}

    #print("--------------------------------------------------")
    #tok = nlp("he is an evil man that hurt my child and sister")
    #svos = findSVOs(tok)
    #printDeps(tok)
    #print(svos)
    #assert set(svos) == {('he', 'hurt', 'child'), ('he', 'hurt', 'sister'), ('man', 'hurt', 'child'), ('man', 'hurt', 'sister')}

    print("--------------------------------------------------")
    tok = nlp("he told me i would die alone with nothing but my career someday")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('he', 'told', 'me')}

    print("--------------------------------------------------")
    tok = nlp("I wanted to kill him with a hammer.")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('i', 'kill', 'him')}

    print("--------------------------------------------------")
    tok = nlp("because he hit me and also made me so angry i wanted to kill him with a hammer.")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('he', 'hit', 'me'), ('i', 'kill', 'him')}

    print("--------------------------------------------------")
    tok = nlp("he and his brother shot me")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('he', 'shot', 'me'), ('brother', 'shot', 'me')}

    print("--------------------------------------------------")
    tok = nlp("he and his brother shot me and my sister")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('he', 'shot', 'me'), ('he', 'shot', 'sister'), ('brother', 'shot', 'me'), ('brother', 'shot', 'sister')}

    print("--------------------------------------------------")
    tok = nlp("the annoying person that was my boyfriend hit me")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('person', 'was', 'boyfriend'), ('person', 'hit', 'me')}

    print("--------------------------------------------------")
    tok = nlp("the boy raced the girl who had a hat that had spots.")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('boy', 'raced', 'girl'), ('who', 'had', 'hat'), ('hat', 'had', 'spots')}

    print("--------------------------------------------------")
    tok = nlp("he spit on me")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('he', 'spit', 'me')}

    print("--------------------------------------------------")
    tok = nlp("he didn't spit on me")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('he', '!spit', 'me')}

    print("--------------------------------------------------")
    tok = nlp("the boy raced the girl who had a hat that didn't have spots.")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('boy', 'raced', 'girl'), ('who', 'had', 'hat'), ('hat', '!have', 'spots')}

    print("--------------------------------------------------")
    tok = nlp("he is a nice man that didn't hurt my child and sister")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('he', 'is', 'man'), ('man', '!hurt', 'child'), ('man', '!hurt', 'sister')}

    print("--------------------------------------------------")
    tok = nlp("he didn't spit on me and my child")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    assert set(svos) == {('he', '!spit', 'me'), ('he', '!spit', 'child')}

    print("--------------------------------------------------")
    tok = nlp("he beat and hurt me")
    svos = findSVOs(tok)
    printDeps(tok)
    print(svos)
    # tok = nlp("he beat and hurt me")

def main():
    testSVOs()

if __name__ == "__main__":
    main()

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

# object and subject constants
OBJECT_DEPS = {"dobj", "dative", "attr", "oprd"}
SUBJECT_DEPS = {"nsubj", "nsubjpass", "csubj", "agent", "expl"}
# tags that define wether the word is wh-
WH_WORDS = {"WP", "WP$", "WRB"}

# extract the subject, object and verb from the input
def extract_svo(doc):
    sub = []
    at = []
    ve = []
    for token in doc:
        # is this a verb?
        if token.pos_ == "VERB":
            ve.append(token.text)
        # is this the object?
        if token.dep_ in OBJECT_DEPS or token.head.dep_ in OBJECT_DEPS:
            at.append(token.text)
        # is this the subject?
        if token.dep_ in SUBJECT_DEPS or token.head.dep_ in SUBJECT_DEPS:
            sub.append(token.text)
    return " ".join(sub).strip().lower(), " ".join(ve).strip().lower(), " ".join(at).strip().lower()

# wether the doc is a question, as well as the wh-word if any
def is_question(doc):
    # is the first token a verb?
    if len(doc) > 0 and doc[0].pos_ == "VERB":
        return True, ""
    # go over all words
    for token in doc:
        # is it a wh- word?
        if token.tag_ in WH_WORDS:
            return True, token.text.lower()
    return False, ""

# gather the user input and gather the info
while True:    
    doc = nlp(input("> "))
    # print out the pos and deps
    for token in doc:
        print("Token {} POS: {}, dep: {}".format(token.text, token.pos_, token.dep_))

    # get the input information
    subject, verb, attribute = extract_svo(doc)
    question, wh_word = is_question(doc)
    print("svo:, subject: {}, verb: {}, attribute: {}, question: {}, wh_word: {}".format(subject, verb, attribute, question, wh_word))

In [None]:
!pip install textacy
import spacy
import textacy


nlp = spacy.load("en_core_web_sm")
text = nlp(u'What kind of testing does New York City use?')

text_ext = textacy.extract.subject_verb_object_triples(text)

print('text:' , text_ext)

In [None]:
import os

%cd /content
if not os.path.exists("GoogleNews-vectors-negative300.bin.gz"):
  !wget -P /content/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
else:
  print("word2vec already downloaded")



/content
word2vec already downloaded


In [None]:
from gensim.models import KeyedVectors

EMBEDDING_FILE = 'GoogleNews-vectors-negative300.bin.gz'

word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)



In [None]:
import scipy
import numpy as np
from scipy import spatial

index2word_set = set(word2vec.wv.index2word)

def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

    

  """


In [None]:
def matching(a, b):
    index2word_set = set(word2vec.wv.index2word)
    s1_afv = avg_feature_vector(a, model=word2vec, num_features=300, index2word_set=index2word_set)
    s2_afv = avg_feature_vector(b, model=word2vec, num_features=300, index2word_set=index2word_set)
    sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
    return sim

In [None]:
print(matching('The United States Flag','what other kind'))



  


0.14754468202590942


In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
import spacy
import pandas as pd
import gensim.downloader as api

lemmatizer = WordNetLemmatizer()
nltk.download('punkt')
pd.set_option('display.max_colwidth', 200)
nlp = spacy.load("en_core_web_sm")
merge_nps = nlp.create_pipe("merge_noun_chunks")
merge_ents = nlp.create_pipe("merge_entities")
merge_subtok = nlp.create_pipe("merge_subtokens")
nlp.add_pipe(merge_nps)
nlp.add_pipe(merge_ents)
nlp.add_pipe(merge_subtok)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
import math

def returnresult(text, question, numofanswers):

    question = question.lower()
    text = text.lower()
    quotelesstext = text.replace('“', "").replace('”',"")

    questiondoc = nlp(question)
    sen_map = {}
    doc = nlp(text)
    sentences = [sent.string.strip() for sent in doc.sents]
    
    scores ={}
    for sentence in sentences:
        scores[sentence] = matching(question,sentence)
    sort_scores = sorted(scores.items(), key=lambda x: x[1], reverse=True)

    if sort_scores[0][1] <= 0.5 or math.isnan(sort_scores[0][1]) :
        return ['there was no match']
    else:
      return sort_scores
    

In [None]:
import json

%cd /content

question_list_file_path = "question_list.json"

with open(question_list_file_path, 'r',  encoding="utf8") as file:

    data = file.read()

question_list = json.loads(data)

correctAnswer = 0

for q in question_list:
    print("************************************************************")
    print("Testing ", q['name'])

    text_file_path =  q['file']
    with open(text_file_path, 'r', encoding="utf8") as file:
        text_data = file.read()

    res = returnresult(text_data,  q['question'], 3)
    print(q['question'])
    
    
    for answer in res[:3]: 
        if q['Answer'].lower() == answer.lower():
            correctAnswer+= 1
            print('correct answer')
            break
            
    print('correct answer: ' ,q['Answer'].lower() )    
    print(res)
    print()

print('accuracy: ' , correctAnswer / len(question_list), ' correct answer: ' , correctAnswer)