<a href="https://colab.research.google.com/github/OmdenaAI/ISS/blob/main/Task_2/Open_Information_Extraction_With_Coreference_Resolution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install urllib3==1.25.10
!pip install neuralcoref --no-binary neuralcoref
!pip install -U spacy==2.1.0
!pip install allennlp==1.0.0 allennlp-models==1.0.0 
!python -m spacy download en
!pip install unidecode

In [36]:
import pandas as pd
from allennlp.predictors.predictor import Predictor
import allennlp_models.structured_prediction
import spacy
import ast
import re
import neuralcoref
from unidecode import unidecode
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [37]:
# Downloads punkt and wordnet corpa and load openie and coref models
predictor_openie = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz")

In [50]:
# Loading case dataset from drive
doc = pd.read_excel('/content/drive/My Drive/Omdena/Task_2_Data_format.xlsx')

In [51]:
doc

Unnamed: 0,Case,Background information,Service Requested,Current Situation/HomeStudy
0,A,Amira’s mother left her daughter at an orphana...,Amira would like to get in contact with her bi...,"Accordingly, she is asking for the time being ..."
1,B,The Social Worker who made the child welfare a...,The Youth Welfare Office of Potsdam has been a...,The family left Ms. Amani’s house few days ago...
2,C,I hope you all are well and sane in these chal...,Anas has an expired German passport but he cl...,Anas is a 16-year-old teenager who lives with ...
3,D,Mr. Ali and Mrs. Leila are Syrian nationals an...,To locate Mr. Ali’s daughter and ensure that M...,The father called informing that his former wi...
4,E,Carol and Fadi are in a long term foster place...,Children’s Services are seeking your cooperati...,Carol and Fadi are settled in their placement ...


In [52]:
# Load english corpus from spacy and all neuralcoref (for coreference resolution) to the pipeline
nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

<spacy.lang.en.English at 0x7f40ba47cc88>

The subsequent part of the code does the following:
1. Cleans data (removes \n & \t tags and decodes non-ASCII characters (there was a document with non ASCII characters) (lines 2 to 9)
2. Finds coreferences and resolves them (lines 10 to 21)
Note: I wrote a custom code to resolve coreferences as the inbuilt one was not optimal for nested coferences (eg. A and B)
3. Finds relations between entities having one predicate and multiple arguments. (lines 22 to 47)

In [56]:
doc_list = list()
for idx, row in doc.iterrows():
  cols = doc.columns.difference(['Case'])
  for col in cols:
    case_info = row[col]
    case_info = re.sub('[\n\t]+',' ',case_info) #removes \n and \t tags
    case_info = unidecode(case_info) #decodes non ASCII characters
    case_doc = nlp(case_info)
    clusters = case_doc._.coref_clusters #finds coreference clusters
    tok_list = list(token.text_with_ws for token in case_doc) #fetches tokens with whitespaces from spacy document
    for cluster in clusters:
      cluster_main_words = set(cluster.main.text.split(' ')) #get tokens from representative cluster name
      for coref in cluster:
        if coref!=cluster.main: #if coreference element is not the representative element of that cluster
            if coref.text!=cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words))==False: 
              #if coreference element text and representative element text are not equal and none of the coreference element words are in representative element
              # This was done to handle nested coreference scenarios
                tok_list[coref.start] = cluster.main.text + case_doc[coref.end-1].whitespace_
                for i in range(coref.start+1, coref.end):
                    tok_list[i] = ""
    case_doc_text = ''.join(tok_list)
    case_doc_coref = nlp(case_doc_text)
    ie_output = list()
    for sentence in case_doc_coref.sents:
      ie_output.append(predictor_openie.predict(sentence=sentence.text)) #Find entity relations for each sentence
    sent_num = len(ie_output)
    for i in range(sent_num):
      min_O_count = 1000
      min_O_idx = 0
      for j in range(len(ie_output[i]['verbs'])):   #Find the relation with the least number of 'O tags (i.e. unassigned tags)
        O_count = ie_output[i]['verbs'][j]['tags'].count('O')
        if O_count < min_O_count:
            min_O_count = O_count
            min_O_idx = j
        else:
            continue
      print(row['Case'],col,i,min_O_idx)
      try:
        phrase_list = re.findall(r'\[.*?\]',ie_output[i]['verbs'][min_O_idx]['description']) #Store the description of the relation with the least unassigned tags
      except:
        continue
      phrases_dict = dict()
      for phrase in phrase_list:
        phrase = phrase.strip('[]').split(": ")
        phrases_dict[phrase[0]] = phrase[1]
        phrases_dict['Case'] = row['Case']
        phrases_dict['Section'] = col
      doc_list.append(phrases_dict)

A Background information 0 0
A Background information 1 1
A Background information 2 0
A Background information 3 1
A Background information 4 0
A Background information 5 1
A Background information 6 0
A Background information 7 3
A Current Situation/HomeStudy 0 1
A Service Requested 0 0
A Service Requested 1 0
A Service Requested 2 1
B Background information 0 2
B Background information 1 2
B Background information 2 2
B Background information 3 1
B Background information 4 1
B Background information 5 1
B Background information 6 1
B Background information 7 0
B Background information 8 0
B Background information 9 0
B Current Situation/HomeStudy 0 0
B Current Situation/HomeStudy 1 1
B Current Situation/HomeStudy 2 0
B Current Situation/HomeStudy 3 2
B Current Situation/HomeStudy 4 0
B Current Situation/HomeStudy 5 0
B Current Situation/HomeStudy 6 4
B Current Situation/HomeStudy 7 0
B Current Situation/HomeStudy 8 0
B Service Requested 0 2
B Service Requested 1 2
B Service Requeste

In [54]:
df_docs = pd.DataFrame(doc_list) #Save the list of dictionaries as a dataframe

In [55]:
df_docs

Unnamed: 0,ARG0,Case,Section,V,ARG1,ARGM-LOC,ARG2,ARGM-TMP,ARGM-ADV,ARGM-CAU,ARGM-DIS,C-ARG1,ARGM-MOD,ARGM-DIR,ARGM-MNR,ARGM-NEG,ARG3,ARGM-PRP,ARGM-PRD,ARGM-COM,ARG4
0,Amira 's mother,A,Background information,left,Amira 's mother daughter,at an orphanage in Beirut,,,,,,,,,,,,,,,
1,,A,Background information,called,The children 's home,,Dar Al - Aytam Al - Islamiyya secure center fo...,,,,,,,,,,,,,,
2,her daughter,A,Background information,spent,,,living in children 's homes,The first nine years of her life,,,,,,,,,,,,,
3,,A,Background information,invited,her daughter,,to come to the Netherlands as a refugee,At the age of nine,,,,,,,,,,,,,
4,,A,Background information,was,The UNHCR,,involved in this process,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,Mr Jerry,E,Current Situation/HomeStudy,has,Mr Naji own place close to the region in Lebanon,,,,,,,,,,,,,,,,
115,Mr Jerry and Ms Maya,E,Current Situation/HomeStudy,informed,us,,"that Mr Naji , Mr Jerry and Ms Maya work in th...",Naji,,,,,,,,,,,,,
116,Mr Jerry and Ms Maya,E,Current Situation/HomeStudy,reported,"that Mr Naji is the Operations Director , Ms M...",,,Naji,,,,,,,,,,,,,
117,Children 's Services,E,Service Requested,seeking,your cooperation to visit the home of Mr. Naji...,,,,,,,,,,,,,,,,
