# **Welcome to the Policy Explore Notebook**

* When examining emails, interviews, discussion threads, or other narrative records, it is often useful to know these exchanges within institutions concern governing policies, and how **closely**.

* This kind of analysis can address fundamental questions in policy reseach and provide insight into practical policy implementation and adoption.

* This notebook uses advances in natural language processing to take two sets of texts.

  * The first data set are the policies overseeing the community/organization. We hereby refer to them as the "queries".
  * The second, a larger "haystack" of institutional discourse, generally day to day operations and exchanges
  * The rules/policies serve as the "query" to the searchbase.
  * The notebook returns the list of exchanges about the formal policy statement (in order of similarity score).

* In the sample use case we provide, the first file is a list of operational rules from the Apache Software Foundation, a group of software peer production communities, and the second file comprises all emails from these communities.

* We encourage users to extend the application for more sophisticated inferences, such as how "rules-in-form" are invoked in normative community behavior.

### **This notebook performs the following tasks**


*   It takes
  * a "query": a single institutional statement
  * a "searchbase": a potentially large corpus of formal or informal exchanges/communication records/ other documents.
* It queries the corpus with the institutional statement
* It returns a dataframe with columns for
  * the statements in the haystack that are most similar to the needle
  * each statements numerical similarity with the needle.
*   By default, it performs these tasks on a dataset we provide, but the intent is that users will upload their own pair of datasets.
*   It outputs a downloadable table of statements from the haystack and their similarities, starting with the most similar statements
.








# **Installations and Setup**
* This code sets up the analysis. You don't have to understand it. Just run it and then scroll down.
* These commands below install the necessary components for the rest of the analysis to work. To run press ***ctrl+enter*** keys or select ***Runtime*** from the menu above and then one of the ***Run*** options within it.

In [1]:
!git clone https://github.com/BSAkash/NLP4GOV
%cd NLP4GOV/src
!pip install -q -r ./policy_explore/requirements.txt
from policy_explore import semantic_search
%cd /content/

Cloning into 'NLP4GOV'...
remote: Enumerating objects: 2367, done.[K
remote: Counting objects: 100% (523/523), done.[K
remote: Compressing objects: 100% (264/264), done.[K
remote: Total 2367 (delta 306), reused 471 (delta 256), pack-reused 1844[K
Receiving objects: 100% (2367/2367), 9.71 MiB | 7.85 MiB/s, done.
Resolving deltas: 100% (1405/1405), done.
/content/NLP4GOV/src
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

/content


# **Data upload**

Run this cell to run your own search engine.

For your own data, you will likely have to adapt it for this notebook to run. See below for the sample format

Please name uploaded files as `main1.csv` and `main2.csv`. <br/>
 **In 'main1.csv', name the policy/search item column as "document". In 'main2.csv', the column corresponding to the text search base should be named "corpus". <br/>
 Data field names are case sensititive. E.g. 'Corpus' instead of 'corpus' will lead to unexpected errors.**

Use following cells to upload the queries and databases <br/>

In [None]:
#upload queries/main1.csv
import os
from google.colab import files
uploaded = files.upload()
os.rename(list(uploaded.keys())[0], 'main1.csv')

In [None]:
#upload searchbase/main2.csv
uploaded = files.upload()
os.rename(list(uploaded.keys())[0], 'main2.csv')

# **Or use our data archives**

You may also uncomment and run the cell below to follow the demonstration on archival data. This shall directly download (into Colab) our datasets for the Apache Software Foundation, an open source software community.

* [Queries: Apache community policies](https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv)

* [Search Base: Apache community emails]( https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv)

In [2]:
# #Else you can directly use the !wget command below to download our datasets into the code notebook,
# #make sure you uncomment the below code by (ctrl+/) keys before running this cell of code

# # query policies
# !wget -O main1.csv https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv
# # search base (just read 'last_reply')
# !wget -O main2.csv https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv

--2024-05-08 05:11:42--  https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.8.207, 142.251.170.207, 173.194.174.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.8.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47082 (46K) [text/csv]
Saving to: ‘main1.csv’


2024-05-08 05:11:43 (1.71 MB/s) - ‘main1.csv’ saved [47082/47082]

--2024-05-08 05:11:43--  https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.8.207, 142.251.170.207, 173.194.174.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.8.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 356478777 (340M) [text/csv]
Saving to: ‘main2.csv’


2024-05-08 05:11:56 (26.5 MB/s) - ‘main2.csv’ saved [356478777/356478777]



# **Dataset Cleaning**

* Prior to uploading the your own dataset remove all the NAN(not a number) text value rows from the csv file.
* Make sure to reset the indexing in the csv file (a sample is shown below)

In [3]:
#read files
query = semantic_search.pd.read_csv("main1.csv")

##Read database

try:

  #prepare the demo Apache dataset and sample a random subset
  search_base = semantic_search.pd.read_csv("main2.csv", usecols=['project_name', 'month', 'date', 'message_id', 'reply']).sample(10000,random_state=0)
  search_base.rename(columns={"reply": "corpus"}, inplace=True)

except:
  search_base = semantic_search.pd.read_csv("main2.csv")
  pass

#drop empty entries
search_base = search_base.dropna(subset=['corpus'])
search_base = search_base.reset_index(drop=True)
search_base

Unnamed: 0,project_name,month,date,message_id,corpus
0,brooklyn,15.0,2015-08-13 03:51:34,<git-pr-819-incubator-brooklyn@git.apache.org>,github user hzbarcea opened a pull request : r...
1,abdera,2.0,2006-08-15 16:29:18,<44E258CE.7050604@gmail.com>,many of the methods in the entry and feed inte...
2,qpid,4.0,2006-12-05 15:50:09,<1ba9d4a00612050750l533d52d9rfd34653ccb915be1@...,well at home i'm on ubuntu edgy so i'll be try...
3,zeppelin,11.0,2015-11-19 00:43:11,<JIRA.12914309.1447893772000.115375.1447893791...,lee moon soo created zeppelin - 444 :
4,lens,7.0,2015-05-18 12:49:59,<JIRA.12830611.1431937294000.143285.1431953399...,rajat khandelwal commented on lens - 562 :
...,...,...,...,...,...
9692,synapse,14.0,2006-10-06 15:28:51,<1160141332.5324.28.camel@localhost.localdomain>,httpcomponents is still very much an evolving ...
9693,zeppelin,15.0,2016-03-08 08:25:02,<CAGU5spcwKJYeEWe7ofYqi1s6pecjiqm6gfheZMMGisRR...,looks like r support is something the communit...
9694,streams,23.0,2014-10-07 15:44:33,<JIRA.12746426.1412696636000.208583.1412696673...,steve blackmon created streams - 187 :
9695,spamassassin,2.0,2004-02-18 17:31:18,<1077150678.4735.62.camel@localhost>,"on this same line, i was also going to suggest..."


# **Selecting a query**

* **NOTE** : The notebook currently supports search by a single policy at a time. We demonstrate the notebook using the first policy from the query set, to search related emails. You may query with another policy by changing the index number from 0 to a different index number.

  query_text = query["document"][** CSV index number of policy query**]

  The notebook also works if you directly type in a policy:

  query_text = "each podling in incubation must report to the incubator pmc"

In [4]:
# this is the first policy from the query list (default)
#Change the [number] to select through queries
query_text = query["document"][0]
print("Query policy: ", query_text)

Query policy:  each podling in incubation must report to the incubator pmc.


# **Semantic Sentence Embeddings**

* This code generates numerical representations of each sentence in the search base and the query itself.

* Each sentence is encoded into a vector of 768 numbers, which are coordinates embedding the policy text in a 768-dimensional semantic space. Interestingly, these coordinates are mapped to texts (in our case policies) in a way such that similar sentences are close to each other and vice versa.

* Thanks to semantic distance preserving properties, these coordinates can therefore be used in clustering and categorizing text, ranking by mutual similarity, text based search and information retrieval.

* Please note that the runtime of this code depends on the size of the search base


In [5]:
data = search_base['corpus']
obj = semantic_search.semantic_search()
data_embed = obj.sentence_embeddings_encode(semantic_search.word_embedding_model, data)# sentence encodings takes approx 15mins to be available

# **Evaluate Cosine similarity between query and search base texts**

In [6]:
cos_similarities = obj.dot_score(query_text,data_embed)
sorted_df = semantic_search.pd.DataFrame({'Data': data, 'Cosine Similarity': cos_similarities}).sort_values(by='Cosine Similarity', ascending=False)
sorted_df['Cosine Similarity'] = sorted_df['Cosine Similarity'].apply(lambda x: x[0].item())
sorted_df

100%|██████████| 9697/9697 [00:00<00:00, 13043.88it/s]


Unnamed: 0,Data,Cosine Similarity
2894,my thoughts are : ( 1 ) incubating podlings ha...,0.575955
4030,"as you may be aware, the incubator has recentl...",0.529124
7715,this email was sent by an automated system on ...,0.522152
1329,could you update the report on this page [ 1 ]...,0.520821
6839,please find the current final draft below. i b...,0.510301
...,...,...
5556,github user tellison commented on a diff in th...,-0.173919
6704,github user tellison commented on a diff in th...,-0.173919
3649,tony faustini created iota - 18 :,-0.174050
9522,github user leemoonsoo commented on the pull r...,-0.175836


# **Convert results to `.csv` and download**
* Download to write the data to a file and retreive data along with similarity scores in that file.
* Download is in `.csv` format and can be opened in any spreadsheet software.
* **Note** : This notebook hardcodes 0.5 as the similarity score above which matches are saved. You may edit the cell below to store results above or below a certain value.
  
  sorted_df[sorted_df['Cosine Similarity'] > **insert cutoff desired**].to_csv(csv_file_path, index=False)

In [7]:
csv_file_path = "sorted_cosine_similarities.csv"
sorted_df[sorted_df['Cosine Similarity'] > 0.5].to_csv(csv_file_path, index=False)

In [None]:
files.download(csv_file_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Going deeper
People with programming experience can modify this notebook, and learn about the method in the process. More details about semantic similarity, semantic search, transformer-based word-embeddings, and other parts of this notebook are in this [README file](https://github.com/BSAkash/NLP4GOV/blob/master/src/policy_explore/README.md).