# **Welcome to the Policy Explore Notebook**

* When examining emails, interviews, discussion threads, or other narrative records, it is often useful to know these exchanges within institutions concern governing policies, and how **closely**.

* This kind of analysis can address fundamental questions in policy reseach and provide insight into practical policy implementation and adoption.

* This notebook uses advances in natural language processing to take two sets of texts.

  * The first data set are the policies overseeing the community/organization. We hereby refer to them as the "queries".
  * The second, a larger "haystack" of institutional discourse, generally day to day operations and exchanges
  * The rules/policies serve as the "query" to the searchbase.
  * The notebook returns the list of exchanges about the formal policy statement (in order of similarity score).

* In the sample use case we provide, the first file is a list of operational rules from the Apache Software Foundation, a group of software peer production communities, and the second file comprises all emails from these communities.

* We encourage users to extend the application for more sophisticated inferences, such as how "rules-in-form" are invoked in normative community behavior.

### **This notebook performs the following tasks**


*   It takes
  * a "query": a single institutional statement
  * a "searchbase": a potentially large corpus of formal or informal exchanges/communication records/ other documents.
* It queries the corpus with the institutional statement
* It returns a dataframe with columns for
  * the statements in the haystack that are most similar to the needle
  * each statements numerical similarity with the needle.
*   By default, it performs these tasks on a dataset we provide, but the intent is that users will upload their own pair of datasets.
*   It outputs a downloadable table of statements from the haystack and their similarities, starting with the most similar statements
.








# **Installations and Setup**
* This code sets up the analysis. You don't have to understand it. Just run it and then scroll down.
* These commands below install the necessary components for the rest of the analysis to work. To run press ***ctrl+enter*** keys or select ***Runtime*** from the menu above and then one of the ***Run*** options within it.

In [1]:
!git clone https://github.com/BSAkash/NLP4GOV
%cd NLP4GOV/src
!pip install -q -r ./policy_explore/requirements.txt
from policy_explore import semantic_search
%cd /content/

Cloning into 'NLP4GOV'...
remote: Enumerating objects: 2345, done.[K
remote: Counting objects: 100% (501/501), done.[K
remote: Compressing objects: 100% (245/245), done.[K
remote: Total 2345 (delta 291), reused 461 (delta 253), pack-reused 1844[K
Receiving objects: 100% (2345/2345), 9.70 MiB | 16.77 MiB/s, done.
Resolving deltas: 100% (1390/1390), done.
/content/NLP4GOV/src
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

/content


# **Data upload**

Run this cell to run your own search engine.

For your own data, you will likely have to adapt it for this notebook to run. See below for the sample format

Please name uploaded files as `main1.csv` and `main2.csv`. <br/>
 **In 'main1.csv', name the policy column as "document". In 'main2.csv', the column corresponding to the text search base should be named "corpus".**


In [None]:
import os
from google.colab import files
uploaded = files.upload()

# **Or use our data archives**

You may also uncomment and run the cell below to follow the demonstration on archival data. This shall directly download (into Colab) our datasets for the Apache Software Foundation, an open source software community.

* [Queries: Apache community policies](https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv)

* [Search Base: Apache community emails]( https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv)

In [None]:
##Else you can directly use the !wget command below to download our datasets into the code notebook,
##make sure you uncomment the below code by (ctrl+/) keys before running this cell of code

## query policies
# !wget -O main1.csv https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv
# # search base (just read 'last_reply')
# !wget -O main2.csv https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv

--2024-03-21 05:38:03--  https://storage.googleapis.com/public_data_c2/IG_datasets/ASF_policies.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.2.207, 142.250.101.207, 2607:f8b0:4023:c0d::cf
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.2.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47082 (46K) [text/csv]
Saving to: ‘main1.csv’


2024-03-21 05:38:03 (133 MB/s) - ‘main1.csv’ saved [47082/47082]

--2024-03-21 05:38:04--  https://storage.googleapis.com/public_data_c2/CHI_zenodo/emails_dev_214.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.2.207, 142.250.101.207, 2607:f8b0:4023:c0d::cf
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.2.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 356478777 (340M) [text/csv]
Saving to: ‘main2.csv’


2024-03-21 05:38:05 (187 MB/s) - ‘main2.csv’ saved [356478777/356478777]



# **Dataset Cleaning**

* Prior to uploading the your own dataset remove all the NAN(not a number) text value rows from the csv file.
* Make sure to reset the indexing in the csv file (a sample is shown below)

In [4]:
#read files
query = semantic_search.pd.read_csv("main1.csv")

##list names of columns you would like to keep in the data base


try:
  search_base = semantic_search.pd.read_csv("main2.csv", usecols=['project_name', 'month', 'date', 'message_id', 'reply'])
  search_base.rename(columns={"reply": "corpus"}, inplace=True)
except:
  search_base = semantic_search.pd.read_csv("main2.csv", usecols=['corpus'])
  pass
print(query.columns)
print(search_base.columns)
search_base = search_base.dropna()
search_base = search_base.reset_index(drop=True)
search_base

Index(['Unnamed: 0', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'document',
       'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17',
       'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21',
       'Unnamed: 22', 'Unnamed: 23'],
      dtype='object')
Index(['corpus'], dtype='object')


Unnamed: 0,corpus
0,Unspoken Rule #1.
1,Be very careful about revealing personal infor...
2,Multiple users have been shadowbanned due to p...
3,Don't do that.
4,Unspoken Rule #2.
5,This means you should try not to be a dick.
6,"If you continually act like a dick, you will b..."
7,"Mind you, most of us are Marines, so there may..."
8,Rule #1.
9,Be civil with others but also don't forget to ...


# **Selecting a query**

* **NOTE** : The notebook currently supports search by a single policy at a time. We demonstrate the notebook using the first policy from the query set, to search related emails. You may query with another policy by changing the index number from 0 to a different index number.

  query_text = query["document"][** CSV index number of policy query**]

  The notebook also works if you directly type in a policy:

  query_text = "each podling in incubation must report to the incubator pmc"

In [11]:
# this is the first policy from the query list (default)
#Change the [number] to select through queries
query_text = query["document"][0]
print("Query policy: ", query_text)

Query policy:  Don't talk about hacking.


# **Semantic Sentence Embeddings**

* This code generates numerical representations of each sentence in the search base and the query itself.

* Each sentence is encoded into a vector of 768 numbers, which are coordinates embedding the policy text in a 768-dimensional semantic space. Interestingly, these coordinates are mapped to texts (in our case policies) in a way such that similar sentences are close to each other and vice versa.

* Thanks to semantic distance preserving properties, these coordinates can therefore be used in clustering and categorizing text, ranking by mutual similarity, text based search and information retrieval.

* Please note that the runtime of this code depends on the size of the search base


In [12]:
data = search_base['corpus']
data_embed = obj.sentence_embeddings_encode(semantic_search.word_embedding_model, data)# sentence encodings takes approx 15mins to be available

# **Evaluate Cosine similarity between query and search base texts**

In [13]:
cos_similarities = obj.dot_score(query_text,data_embed)
sorted_df = semantic_search.pd.DataFrame({'Data': data, 'Cosine Similarity': cos_similarities}).sort_values(by='Cosine Similarity', ascending=False)
sorted_df['Cosine Similarity'] = sorted_df['Cosine Similarity'].apply(lambda x: x[0].item())
sorted_df

100%|██████████| 36/36 [00:00<00:00, 7540.32it/s]


Unnamed: 0,Data,Cosine Similarity
3,Don't do that.,0.483757
14,Do not try to debate the mod on your interpret...,0.358261
34,We will not feed that blackmail/sexual harassm...,0.336723
7,"Mind you, most of us are Marines, so there may...",0.333354
5,This means you should try not to be a dick.,0.289818
16,"Do not post any content which violates OPSEC, ...",0.283848
21,We can make exceptions for non-profit organiza...,0.25925
1,Be very careful about revealing personal infor...,0.244336
9,Be civil with others but also don't forget to ...,0.244332
2,Multiple users have been shadowbanned due to p...,0.231217


# **Convert results to `.csv` and download**
* Download to write the data to a file and retreive data along with similarity scores in that file.
* Download is in `.csv` format and can be opened in any spreadsheet software.
* **Note** : This notebook hardcodes 0.5 as the similarity score above which matches are saved. You may edit the cell below to store results above or below a certain value.
  
  sorted_df[sorted_df['Cosine Similarity'] > **insert cutoff desired**].to_csv(csv_file_path, index=False)

In [None]:
csv_file_path = "sorted_cosine_similarities.csv"
sorted_df[sorted_df['Cosine Similarity'] > 0.5].to_csv(csv_file_path, index=False)

In [None]:
files.download(csv_file_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Going deeper
People with programming experience can modify this notebook, and learn about the method in the process. More details about semantic similarity, semantic search, BM25Okapi, transformer-based word-embeddings, and other parts of this notebook are in this [README file](https://github.com/BSAkash/NLP4GOV/blob/master/src/policy_explore/README.md).