# **Welcome to ABDICO_parsing Code Notebook**

It is often useful to extract the main entities and actions described in a document. For example, this is useful in the Institutional Grammar (IG) for identifying some of the most common parts of an institutional statement, like the Attribute and Object, along with their Aim and Deontic.

This notebook uses new methods to automate this extraction process, giving policy scholars an easy route to some of the most common IG elements (though it's not restricted to IG, and is useful for extracting (and relating) these syntactic/semantic elements in any kind of text).

### **This notebook performs the following tasks**


*   It takes a corpus of policy sentences from any domain and extracts four central components if the institutional grammar from each: Attribute, Aim, Deontic, and Object
  *   By default, it performs these tasks on a dataset we provide, but the intent is that users will upload their own datasets.
*   When given unlabeled data, it applies labels
*   When given labeled (such as hand-coded) data, it applies labels and compares the accuracy of its labels to those provided.
*   It outputs a downloadable file of the original statements with labels, as well as statistics and figures that can be used in reporting.



### This notebook follows an approach to automatic coding with several features

Our approach is distinct from other manual and automatic approach to IG coding:
* *It requires no training data*. You don't have to do any labelling. Labels may still be useful if you want a sense of the accuracy of our approach for your method, but such validation requires the coding of only a sample.
* *It supports users of many experience levels*. To benefit from our approach, at minimum you have to be able to get your data into the right format (the right number of spreadsheet columns in the right order), upload it, run the code, and download the outputs. You don't need to know how to program, and you don't have to understand this notebook, you just have to be bold enough to face it.  But because we expose the code, people who know how to code or are trying to learn can change it and adapt it.
* *It uses both syntactic and semantic information*. This is important because not any grammatical subject of a sentence (a syntactic property) can be an Attribute, that subject has to be some kind of agent capable of action and decision making (semantic property)
* *It is naturally robust to passive voice*. Passive voice increases the complexity of manual coding a bit, and automatic coding a lot.  Our approach works around this issue to naturally identify Attributes and Objects even when they are grammatical objects and subjects, respectively.
* *It gets better as AI gets better*. This is because the code is written to permit the switching-in of different models as "back-ends" for the labelling.
* *With minor changes it can accommodate other languages*.  For the same reason as above, if a back-end model has been written for your language you can code in that language.  
* *We do not implement the full institutional grammar* A drawback of our approach is that it only extracts the four most commonly used part of IG. A related drawback is that there are minor technical differences in how we define those entities (such as agent) and how formal IG defines them. Our operationalization is close enough that it is sufficient to be clear in reporting about where your labels came from.



# **Installation & Setup**
This code sets up the analysis. You don't have to understand it. Just run it and then scroll down.

Run these commands below for necessary installations. To run press ***ctrl+enter*** keys.

In [1]:
!git clone https://github.com/BSAkash/NLP4GOV
%cd NLP4GOV/src
!pip install -q -r ./SRL/requirements.txt
!pip install --upgrade huggingface-hub
!python -m spacy download en_core_web_sm
from SRL import SRL
%cd /content/

Cloning into 'NLP4GOV'...
remote: Enumerating objects: 2279, done.[K
remote: Counting objects: 100% (435/435), done.[K
remote: Compressing objects: 100% (204/204), done.[K
remote: Total 2279 (delta 243), reused 420 (delta 228), pack-reused 1844[K
Receiving objects: 100% (2279/2279), 8.91 MiB | 9.09 MiB/s, done.
Resolving deltas: 100% (1342/1342), done.
/content/NLP4GOV
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m730.2/730.2 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m464.5/464.5 kB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m933.2/933.2 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Output()

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


/content


# **Data upload**
Upload your dataset file here. Or, to see it run on an example dataset, download the provided example datafile [ASF_policies_coref_resolved.csv](https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies_coref_resolved.csv) , to see example outputs and to become familiar with the flow.

For your own data, you will likely have to adapt it for this notebook to run. See below for the sample format

First run your own data `.csv` file through ABDICO_coreferences.ipynb file. Download the coreferences resolved `main.csv` file from there and upload that file here as `main.csv` in a `.csv` file format and Policy document column should be named `raw institutional statement`.

In [None]:
##allow user to upload files : Upload file should be named main.csv. Policy statement column should be named "raw institutional statement"
import os
from google.colab import files
uploaded = files.upload()
os.rename(list(uploaded.keys())[0], 'main.csv')

This is our ASF policy dataset for demonstration purpose and you do not need to run this if you have uploaded your own file in the above cell. If you still want to run this please make sure to uncomment the code in the below cell using the (ctrl+/) keys

In [2]:
# !wget -O main.csv https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies_coref_resolved.csv

--2024-04-30 07:27:21--  https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies_coref_resolved.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.200.207, 74.125.130.207, 74.125.68.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.200.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 209611 (205K) [text/csv]
Saving to: ‘main.csv’


2024-04-30 07:27:23 (396 KB/s) - ‘main.csv’ saved [209611/209611]




# Applying labels

*  Run this cell to get ABDICO constiuents on "main.csv" list item
* The `.inference()` function below processes your uploaded dataset to add labels based on the text column.


In [None]:
%cd /content
instance = SRL.SRL()
instance.inference('main.csv')

/content


# **Download labeled data**
* Download to write the data to a file and retreive labeled data in that file.

* Download is in `.csv` format and can be opened in any spreadsheet software.

* Outputs can be found under "Attribute inf", "Object inf", "Deontic inf", "Aim inf"

In [None]:
files.download('main.csv')# Download the file and view the results here!!

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Extra: Calculate accuracy of labels with ground annotations**
If the data contains ground/manually annotated labels, this code will compare them to generated labels for evaluation. <br/>

The statistic we provide a is typical useful summary statistic for classifiers of this type. Specifically, we provide F1, an alternative to accuracy that is suitable for evaluating a classifier when its labels are rare. Like accuracy, F1 ranges from 0 to 1. Read more about F1 score [here](https://towardsdatascience.com/the-f1-score-bec2bbc38aa6)

The main.csv file must contain columns designated for ABDI ground labels, marked as "Attribute", "Object", "Deontic" and "Aim" respectively

See [here](https://github.com/BSAkash/NLP4GOV/tree/master/src/SRL/data) for examples of annotated datasets.



In [3]:
!rm -rf /content/data
!mkdir -p /content/data/new_eval && mv main.csv /content/data/new_eval
evaluation = SRL.SRL("eval")
plt = evaluation.srl_eval('/content/data/')

new_eval Dataset:  264
Please check if file(s) have ground truth ABDICO labels


The cell below reproduces performances reported [here](https://arxiv.org/pdf/2404.03206)

In [None]:
evaluation = SRL.SRL("eval")
plt = evaluation.srl_eval()

/content/NLP4GOV
Aquaculture_Policies Dataset:  156
Dataset after removing uncoded statements:  153
Abstractive coding in one or more fields: 152
 F1 score for attribute: 0.7603353601055807
 F1 score for object: 0.5665267342852559
 F1 score for deontic: 0.9280575539568345
 F1 score for aim: 0.8664335664335665
Food_policy_data Dataset:  496
Dataset after removing uncoded statements:  398
 F1 score for attribute: 0.7161072932477918
