# **Welcome to ABDICO_coreferences Code Notebook**

It is often important to resolve anaphoras to extract the main entities and actions described in a document. This resolution process aids in making the extraction process hassle free. This notebook helps in achieving the resolution of anaphoras.

**Example** : After anaphora resolution, it becomes clear and specific that "them" in the policy refers to Podling websites

      Before:
            Statement: "there are restrictions on where podlings can host their websites and what branding they can use on them."
            Attribute : "Podlings" (observing restrictions)
            Objects : "their websites", "them"

      After Anaphora resolutions:
            Statement: "there are restrictions on where podlings can host their websites and what branding podlings can use on their websites"
            Attribute : "Podlings" (observing restrictions)
            Objects : "their websites", "their websites"

### **This notebook performs the following tasks**


*   Resolves document Anaphoras in the uploaded dataset
*   It outputs a downloadable file of the resolved document anaphoras that can be used further in ABDICO_parsing notebook for labelling data and presenting statistics and figures that can be used in reporting.
* ### NOTE:
    - It is ideal to limit your data with 5-10 sentences per each row.


# **Installations and Setup**
This code sets up the analysis. You don't have to understand it. Just run it and then scroll down.

Run these commands below for necessary installations. To run press ***ctrl+enter*** keys.

In [1]:
!git clone https://github.com/BSAkash/NLP4GOV #cloning the github repo, you can visit this url for the entire directory
%cd NLP4GOV
!pip install -q -r ./constituent_coreferences/requirements.txt # installs all required dependencies in one go!
from constituent_coreferences import corefs
%cd /content/

Cloning into 'NLP4GOV'...
remote: Enumerating objects: 2138, done.[K
remote: Counting objects: 100% (294/294), done.[K
remote: Compressing objects: 100% (155/155), done.[K
remote: Total 2138 (delta 154), reused 275 (delta 137), pack-reused 1844[K
Receiving objects: 100% (2138/2138), 7.94 MiB | 20.79 MiB/s, done.
Resolving deltas: 100% (1253/1253), done.
/content/NLP4GOV
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m730.2/730.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m464.5/464.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m933.2/933.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Output()

Downloading:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/634M [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


/content


# **Data upload**
Upload your dataset file here.

For your own data, you will likely have to adapt it for this notebook to run.

Please name uploaded file as `main.csv` in a `.csv` file format and Policy document column should be named `document`


In [2]:
import os
from google.colab import files
uploaded = files.upload()
os.rename(list(uploaded.keys())[0], 'main.csv')

This is our ASF policy dataset for demonstration purpose and you do not need to run this if you have uploaded your own file in the above cell. If you want to run this make sure to uncomment the code in the below cell using the (ctrl+/) keys

In [3]:
# !wget -O main.csv https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv

--2024-03-21 07:26:04--  https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.193.207, 173.194.194.207, 173.194.195.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.193.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47082 (46K) [text/csv]
Saving to: ‘main.csv’


2024-03-21 07:26:05 (4.66 MB/s) - ‘main.csv’ saved [47082/47082]



# **Convert results to `.csv` and download**
*   Run this cell to resolve document anaphoras on "main.csv"
*   Outputs can be found under "raw institutional statement" header




In [4]:
import pandas as pd
instance = corefs.corefs()
new_data = pd.read_csv("main.csv")
new_data['coref'] = new_data['document'].apply(lambda x : instance.coref(x))
new_data['raw institutional statement'] = new_data['coref'].apply(lambda x : [sentence.text for sentence in instance.nlp(x).sentences])
new_data = new_data.explode(['raw institutional statement'])
new_data.to_csv("main.csv",index=False)

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| pos       | combined_charlm           |
| lemma     | combined_nocharlm         |
| depparse  | combined_charlm           |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!
  num_effective_segments = (seq_lengths + self._max_length - 1) // self._max_length


In [5]:
files.download("main.csv")  #download file here

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>