# **Welcome to the ABDICO clustering Notebook**#

* This notebook helps recognize dominant groups of related actors, objects, regulated actions ("aims") and modals ("deontics") in institutional statements.
Compared to traditional word based topic modeling such as LDA, we use BERTopic, which pursues a semantic ('word meaning') based approach.

### ***This notebook performs the following tasks***


* It takes a main.csv file which has columns designated as "Attribute", "Objects", "Deontics" and "Aims" respectively
* User indicates the institutional constituent (ABDICO) over which clustering is to be performed
* Output main.csv contains a "_group" column, indicating the respective topic cluster to which the constituent belongs.
* The categorical topic of the group is indicated by the top most representative words from the cluster.








# **Installations and Setup**
* This code sets up the analysis. You don't have to understand it. Just run it and then scroll down.
* These commands below install the necessary components for the rest of the analysis to work. To run press ***ctrl+enter*** keys or select ***Runtime*** from the menu above and then one of the ***Run*** options within it.

In [None]:
!pip install transformers
!pip install sentence-transformers
!pip install bertopic

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m63.6 MB/s[0m eta [36m0:00:00[0m
Col

# **Data upload**

Run this cell to run your own search engine.

For your own data, you will likely have to adapt it for this notebook to run. See below for the sample format

Please name uploaded files as `main.csv` and `main2.csv`. For each institutional statement, there must be columns indicating the corresponding ABDICO codings. That is "Attribute", "Object", "Deontic", "Aim". These may be hancoded or computationally extracted.

See ABDICO_parsing.ipynb in this repository for automated institutional grammar coding of policies.


In [None]:
from google.colab import files
uploaded = files.upload()

# **Or use our data archives**

You may also uncomment and run the cell below to follow the demonstration on archival data. This shall directly download (into Colab) our datasets for the Apache Software Foundation, an open source software community.

* [ABDICO coded Apache community policies](https://storage.googleapis.com/cscw_2022/anamika_os.csv)

In [None]:
##Else you can directly use the !wget command below to download our datasets into the code notebook,
##make sure you uncomment the below code by (ctrl+/) keys before running this cell of code

## ABDICO coded policy example
# !wget -O main.csv https://storage.googleapis.com/routines_semantic/ASF_ABDICO.csv
# import pandas as pd
# import nltk
# nltk.download('punkt')
# data = pd.read_csv('main.csv')
# data['raw institutional statement'] = data['raw institutional statement'].apply(lambda x : nltk.tokenize.sent_tokenize(x))
# data = data.explode('raw institutional statement')
# data.to_csv('main.csv',index=False)

--2023-10-19 02:18:09--  https://storage.googleapis.com/routines_semantic/ASF_ABDICO.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.199.207, 108.177.98.207, 74.125.197.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.199.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61905 (60K) [text/csv]
Saving to: ‘main.csv’


2023-10-19 02:18:09 (133 MB/s) - ‘main.csv’ saved [61905/61905]



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#**Clustering and topic modeling**

* Prior to uploading the your own dataset remove all the NAN(not a number) text value rows from the csv file.
* You may need to make some changes to the first three lines of the following cell:

  * **Component** : The insitutional component of policies/institutional statements you would like to cluster and group. This should exactly match the column in which data which contains the ABDICO component of interest. Case Sensitive.
  * **top_n_words** : The N dominant words to indidate/represent the topic of the group. We recommend setting this number no more than 10.
  * **num_topic** : Now this one is a little tricky. This value guides the clustering alogithm to find the specified number of clusters and topics among the ABDICO components. You may use an estimate (e.g. if you are clustering attributes and are aware that there are around 5 different actors in your policy data), or set a slightly higher number. It's simpler to manually combine some similar clusters. Our default is 20, for our apache policies.

In [None]:
Component = "Object"
top_n_words = 3
num_topic = 20

# clustering of components
from bertopic import BERTopic
import pandas as pd
result = pd.read_csv('main.csv', usecols=[Component])
result.dropna(subset=[Component],inplace=True)
components = result[Component].tolist()

topic_model = BERTopic(top_n_words = 5,nr_topics = num_topic)
topic_model.hdbscan_model.gen_min_span_tree=True
topic_model.umap_model.random_state= 0 ##set seed to enable reproduction of clustering

topic_model.fit(components)
freq = topic_model.get_topic_info()
result[Component + '_group'] = topic_model.transform(components)[0]
result[Component + '_group'] = result[Component + '_group'].apply(lambda x : freq.loc[x+1,'Name'])
result.to_csv('main.csv', index=False)
files.download('main.csv')

   Topic  Count                                    Name  \
0     -1     32              -1_what_they_found_checked   
1      0     16       0_champion_mentors_incubation_and   
2      1     12  1_disclaimer_disclaimers_progress_work   
3      2     15             2_asf_non_policies_releases   
4      3     18             3_apache_the_replace_effort   
5      4     12     4_releases_cut_available_conditions   
6      5     21            5_podling_podlings_on_voting   
7      6     52                 6_the_to_incubator_that   
8      7     23       7_community_diverse_to_committers   
9      8     11              8_vote_votes_line_approach   

                                    Representation  \
0                 [what, they, found, checked, to]   
1     [champion, mentors, incubation, and, mentor]   
2  [disclaimer, disclaimers, progress, work, text]   
3     [asf, non, policies, releases, successfully]   
4             [apache, the, replace, effort, name]   
5    [releases, cut, avail

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Going deeper
Check out [BERTopic](https://maartengr.github.io/BERTopic/) documentation for more arguments, parameters and methods for sophisticated topic modeling. For "fine tuning" your topic modeling, see our [work](https://arxiv.org/abs/2309.14245) on governance of open source software.