# **Welcome to the Policy Explore Notebook**

* When examining emails, interviews, discussion threads, or other narrative records, it is often useful to know these exchanges within institutions concern governing policies, and how **closely**.

* This kind of analysis can address fundamental questions in policy reseach and provide insight into practical policy implementation and adoption.

* This notebook uses advances in natural language processing to take two sets of texts.

  * The first data set are the policies overseeing the community/organization. We hereby refer to them as the "queries".
  * The second, a larger "haystack" of institutional discourse, generally day to day operations and exchanges
  * The rules/policies serve as the "query" to the searchbase.
  * The notebook returns the list of exchanges about the formal policy statement (in order of similarity score).

* In the sample use case we provide, the first file is a list of operational rules from the Apache Software Foundation, a group of software peer production communities, and the second file comprises all emails from these communities.

* We encourage users to extend the application for more sophisticated inferences, such as how "rules-in-form" are invoked in normative community behavior.

### **This notebook performs the following tasks**


*   It takes
  * a "query": a single institutional statement
  * a "searchbase": a potentially large corpus of formal or informal exchanges/communication records/ other documents.
* It queries the corpus with the institutional statement
* It returns a dataframe with columns for
  * the statements in the haystack that are most similar to the needle
  * each statements numerical similarity with the needle.
*   By default, it performs these tasks on a dataset we provide, but the intent is that users will upload their own pair of datasets.
*   It outputs a downloadable table of statements from the haystack and their similarities, starting with the most similar statements
.








# **Installations and Setup**
* This code sets up the analysis. You don't have to understand it. Just run it and then scroll down.
* These commands below install the necessary components for the rest of the analysis to work. To run press ***ctrl+enter*** keys or select ***Runtime*** from the menu above and then one of the ***Run*** options within it.

In [None]:
!git clone https://github.com/BSAkash/NLP4GOV
%cd NLP4GOV/src
!pip install -q -r ./policy_explore/requirements.txt
from policy_explore import semantic_search
%cd /content/

# **Data upload**

Run this cell to run your own search engine.

For your own data, you will likely have to adapt it for this notebook to run. See below for the sample format

Please name uploaded files as `main1.csv` and `main2.csv`. In 'main1.csv', name the policy column as "document". In 'main2.csv', the column corresponding to the text search base should be named "corpus".


In [2]:
import os
from google.colab import files
uploaded = files.upload()

# **Or use our data archives**

You may also uncomment and run the cell below to follow the demonstration on archival data. This shall directly download (into Colab) our datasets for the Apache Software Foundation, an open source software community.

* [Queries: Apache community policies](https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv)

* [Search Base: Apache community emails]( https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv)

In [8]:
##Else you can directly use the !wget command below to download our datasets into the code notebook,
##make sure you uncomment the below code by (ctrl+/) keys before running this cell of code

## query policies
# !wget -O main1.csv https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv
# # search base (just read 'last_reply')
# !wget -O main2.csv https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv

--2024-03-21 05:38:03--  https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.2.207, 142.250.101.207, 2607:f8b0:4023:c0d::cf
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.2.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47082 (46K) [text/csv]
Saving to: ‘main1.csv’


2024-03-21 05:38:03 (133 MB/s) - ‘main1.csv’ saved [47082/47082]

--2024-03-21 05:38:04--  https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.2.207, 142.250.101.207, 2607:f8b0:4023:c0d::cf
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.2.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 356478777 (340M) [text/csv]
Saving to: ‘main2.csv’


2024-03-21 05:38:05 (187 MB/s) - ‘main2.csv’ saved [356478777/356478777]



# **Dataset Cleaning**

* Prior to uploading the your own dataset remove all the NAN(not a number) text value rows from the csv file.
* Make sure to reset the indexing in the csv file (a sample is shown below)

In [9]:
#read files
query = semantic_search.pd.read_csv("main1.csv")
search_base = semantic_search.pd.read_csv("main2.csv", usecols=['project_name', 'month', 'date', 'message_id', 'reply'])

try: search_base.rename(columns={"reply": "corpus"}, inplace=True)
except: pass
print(query.columns)
print(search_base.columns)
search_base = search_base.dropna()
search_base = search_base.reset_index(drop=True)
search_base

Index(['policy.doc.name', 'section.name', 'document'], dtype='object')
Index(['project_name', 'month', 'date', 'message_id', 'corpus'], dtype='object')


Unnamed: 0,project_name,month,date,message_id,corpus
0,jena,16.0,2012-03-25 14:22:54,<4F6F1C2E.7000606@googlemail.com>,robert vesse wrote : i'd like to see faster re...
1,impala,6.0,2016-06-29 05:19:40,<201606290519.u5T5JeWS014957@ip-10-146-233-104...,bharath vissapragada has posted comments on th...
2,oozie,12.0,2012-07-16 18:30:36,<317574600.59506.1342463436256.JavaMail.jirato...,mona chitnis updated oozie - 907 :
3,hama,6.0,2008-11-12 20:57:55,<eb4706e0811120357x312a6d66ga5a18bedbeeb30a3@m...,"oh, problem is a block size. blockeddensematri..."
4,carbondata,4.0,2016-10-21 14:07:06,<20161021140706.22E7ADFDC4@git1-us-west.apache...,github user zhangshunyu closed the pull reques...
...,...,...,...,...,...
703259,iotdb,1.0,2018-12-20 16:26:51,<42cf87a3.2c890.167cab978c1.Coremail.x-y16@mai...,i prefer jira.
703260,cloudstack,6.0,2012-10-03 07:55:46,<9DDC9CE7-F7A6-421F-9A50-A887EF814B1D@basho.com>,has their been any consideration given to shif...
703261,netbeans,3.0,2017-01-01 16:08:28,<20170101160828.73540.19791@johns-mbp-2.home>,this email was sent by an automated system on ...
703262,qpid,4.0,2006-12-13 11:12:21,<6492006.1166037141581.JavaMail.jira@brutus>,gordon sim commented on qpid - 65 :


# **Selecting a query**

* **NOTE** : The notebook currently supports search by a single policy at a time. We demonstrate the notebook using the first policy from the query set, to search related emails. You may query with another policy by changing the index number from 0 to a different index number.

  query_text = query["document"][** CSV index number of policy query**]

  The notebook also works if you directly type in a policy:

  query_text = "each podling in incubation must report to the incubator pmc"

In [10]:
query_text = query["document"][0] # this is the first policy from the query list
print("Query policy: ", query_text)

Query policy:  each podling in incubation must report to the incubator pmc.


#**Word based filtering (BM25Okapi)**
The code below performs a quick, perliminary, word based filtering of the search base. We discard all entries in the search base which do not have any word match with the query (except stopwords).

In [11]:
obj = semantic_search.semantic_search()
filtered_data = obj.lex_search(query_text, search_base)# lex_search takes approx 16mins to run and produce filtered_data
filtered_data

100%|██████████| 703264/703264 [16:38<00:00, 704.03it/s]


Unnamed: 0,corpus_id,score
16386,the incubator pmc would appreciated if you cou...,21.750234
1152,the incubator pmc would appreciated if you cou...,21.750234
54443,the incubator pmc would appreciated if you cou...,21.750234
18194,the incubator pmc would appreciated if you cou...,21.750234
16681,the incubator pmc would appreciated if you cou...,21.750234
...,...,...
47982,this is the vote for apache hawq ( incubating ...,0.669324
20495,update : started with centos 6. 7 ( final ) bu...,0.669324
26627,i've staged a release candidate for aries 0. 2...,0.666647
22489,@ tobegit3hub and here is the entire cmake log...,0.651024


# **Semantic Sentence Embeddings**

* This code generates numerical representations of each sentence in the search base and the query itself.

* Each sentence is encoded into a vector of 768 numbers, which are coordinates embedding the policy text in a 768-dimensional semantic space. Interestingly, these coordinates are mapped to texts (in our case policies) in a way such that similar sentences are close to each other and vice versa.

* Thanks to semantic distance preserving properties, these coordinates can therefore be used in clustering and categorizing text, ranking by mutual similarity, text based search and information retrieval.

* Please note that the runtime of this code depends on the size of the search base


In [12]:
data = filtered_data['corpus_id']
data_embed = obj.sentence_embeddings_encode(semantic_search.word_embedding_model, data)# sentence encodings takes approx 15mins to be available

# **Evaluate Cosine similarity between query and search base texts**

In [13]:
cos_similarities = obj.dot_score(query_text,data_embed)
sorted_df = semantic_search.pd.DataFrame({'Data': data, 'Cosine Similarity': cos_similarities}).sort_values(by='Cosine Similarity', ascending=False)
sorted_df['Cosine Similarity'] = sorted_df['Cosine Similarity'].apply(lambda x: x[0].item())
sorted_df

100%|██████████| 55620/55620 [00:05<00:00, 10501.15it/s]


Unnamed: 0,Data,Cosine Similarity
5601,"i submitted the report just now. besides, i ad...",0.722142
52037,github user apetresc opened a pull request : s...,0.691571
18451,github user zhangh43 opened a pull request : y...,0.688802
38285,"le 4 / 30 / 12 2 : 57 pm, francesco chicchiric...",0.688802
51504,i've also included in the to podlings i've mad...,0.688802
...,...,...
14278,github user darionyaphet closed the pull reque...,-0.167637
26815,the apache wookie ( incubating ) team is pleas...,-0.175459
37043,i'm happy to announce that openjpa 0. 9. 7 pas...,-0.177239
38704,the click community would like to push out our...,-0.179297


# **Convert results to `.csv` and download**
* Download to write the data to a file and retreive data along with similarity scores in that file.
* Download is in `.csv` format and can be opened in any spreadsheet software.
* **Note** : This notebook hardcodes 0.5 as the similarity score above which matches are saved. You may edit the cell below to store results above or below a certain value.
  
  sorted_df[sorted_df['Cosine Similarity'] > **insert cutoff desired**].to_csv(csv_file_path, index=False)

In [14]:
csv_file_path = "sorted_cosine_similarities.csv"
sorted_df[sorted_df['Cosine Similarity'] > 0.5].to_csv(csv_file_path, index=False)

In [15]:
files.download(csv_file_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Going deeper
People with programming experience can modify this notebook, and learn about the method in the process. More details about semantic similarity, semantic search, BM25Okapi, transformer-based word-embeddings, and other parts of this notebook are in the [README file](https://github.com/BSAkash/NLP4GOV/blob/master/policy_explore/README.md).