# **Welcome to Semantic Search Google Colab Notebook**#

[SETH SUGGESTS]
# **Welcome to Semantic Search Code Notebook**
When examining meeting minutes, interviews, court opinions, or other narrative records, it is often useful to know, for a given statement, what formal rule it is "about." With an automated mapping of informal statements to formal rules, it becomes  possible to perform direct comparisons of an institution's "rules-in-use" to its "rules-in-form".

This kind of analysis can address fundamental issues around policy implementation, policy internalization, and emergent norms.

This notebook uses advances in natural language processing to take two sets of texts to search one with the other.  The second a large "haystack" of institutional discourse, and the first a policy statement that you "query" the larger corpus with: a "needle" to search for in the haystack. What it gives back are the statements in the haystack that are the most like the query. Put another way, the search returns the informal statements that are most likely to be "about" the formal policy statement (in terms of their numerical similarities to that statement).

In the sample use case we provide, the first file is gives several example institutional statements, from our work on commons-based peer production communities, and the second file is a quite large list of governance-related sentences from the community's public emails. The code searches the "haystack" of email sentences with the "needle" of one of the community's rules, for the statements that relate to it. From this step, more complex tasks can be performed to compare "rules-in-form" and "rules-in-use" in this community.


[SETH SUGGESTS REPLACING BELOW]
### This notebook performs the following tasks


*   It takes
  * a "needle": a single institutional statement
  * a "haystack": a potentially large corpus of formal or informal institutional statements from any domain
* It queries the corpus with the institutional statement
* It returns a dataframe with columns for
  * the statements in the haystack that are most similar to the needle
  * each statements numerical similarity with the needle.
*   By default, it performs these tasks on a dataset we provide, but the intent is that users will upload their own pair of datasets.
*   It outputs a downloadable table of statements from the haystack and their similarities, starting with the most similar statemethe
.


**This Note Book performs the following tasks**


*   Takes a query and a database as input and performs semantic search using BM25Okapi and Word-encoding models
*   Give out Cosine similarity scores of the query matchings with the database and rank them in the desceding order of similarity.

[SETH SUGGESTS]
### Going deeper
People with programming experience can modify this notebook, and learn about the method in the process. More details about semantic similarity, semantic search, BM25Okapi, transformer-based word-embeddings, and other parts of this notebook are in the [README file](https://github.com/BSAkash/IG-SRL/blob/Akash/README.md).

[XXX IS THIS URL BELOW THE INTENDED ONE?]

**More details about this Google Colab Notebook in the [README file](https://github.com/BSAkash/IG-SRL/blob/Akash/README.md) !!**

[SETH SUGGESTS]
# **Boilerplate code**
This code sets up the analysis. You don't have to understand it. Just run it and then scroll down.

Run these commands below for necessary installations. To run press ***ctrl+enter*** keys

[SETH SUGGESTS] These commands below install the necessary components for the rest of the analysis to work. To run press ***ctrl+enter*** keys or select ***Runtime*** from the menu above and then one of the ***Run*** options within it.

In [None]:
!git clone https://github.com/BSAkash/IG-SRL
%cd IG-SRL
!git checkout Akash
!pip install -r ./policy_explore/requirements.txt
from policy_explore import semantic_search
%cd /content/

Cloning into 'IG-SRL'...
remote: Enumerating objects: 245, done.[K
remote: Total 245 (delta 0), reused 0 (delta 0), pack-reused 245[K
Receiving objects: 100% (245/245), 1.01 MiB | 16.09 MiB/s, done.
Resolving deltas: 100% (117/117), done.
/content/IG-SRL
Branch 'Akash' set up to track remote branch 'Akash' from 'origin'.
Switched to a new branch 'Akash'
Collecting transformers (from -r ./policy_explore/requirements.txt (line 1))
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rank_bm25 (from -r ./policy_explore/requirements.txt (line 2))
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting sentence-transformers (from -r ./policy_explore/requirements.txt (line 3))
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.5 MB/s[0m eta [3

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


Downloading (…)33c52/.gitattributes:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)4cd7833c52/README.md:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading (…)d7833c52/config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)cd7833c52/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)cd7833c52/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)7833c52/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

/content


#**Download the datasets**

[SETH SUGGESTS]
# **Data upload**
[Follow the guidance I provided for this section in the SRL notebook. Rather than wget, there should be an upload button, and a link to a sample dataset that users can download and then manually upload.  ]

In [None]:
##assymetric search:L query related emails from policies
# query policies
!wget https://storage.googleapis.com/cscw_2022/anamika_os.csv
# search base (just read 'last_reply')
!wget https://storage.googleapis.com/routines_semantic/srl_emails.csv

--2023-10-01 03:17:05--  https://storage.googleapis.com/cscw_2022/anamika_os.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.79.207, 108.177.119.207, 108.177.126.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.79.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47296 (46K) [text/csv]
Saving to: ‘anamika_os.csv’


2023-10-01 03:17:06 (7.74 MB/s) - ‘anamika_os.csv’ saved [47296/47296]

--2023-10-01 03:17:06--  https://storage.googleapis.com/routines_semantic/srl_emails.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.79.207, 108.177.119.207, 108.177.126.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.79.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1528193538 (1.4G) [text/csv]
Saving to: ‘srl_emails.csv’


2023-10-01 03:17:49 (33.7 MB/s) - ‘srl_emails.csv’ saved [1528193538/1528193538]



#**Dataset Cleaning**

[SETH SUGGESTS to remove this section.  This section seems to assume that the input to the notebook is emails.  That hasn't been said, and I don't think it's true.  Instead of having this section:
* Remove everything related to emails
* Add to the section above the intended inputs to this notebook.  What kind of data should people have to upload? You must tell them. Certainly we don't intend for them to publish findings on our email dataset, or even on other email datasets.
  * Also add to the above a description any cleaning steps performed here that must be performed prior to upload
* replace the standard uploaded file with an already cleaned version.
  * ... unless there are cleaning steps that should always be performed. Leave those in.
  * The standard uploaded file should probably have a minimum of columns as well, or a clear specification of how users specify which specific columns as the focus of the analysis.

In [None]:
#read files
query = semantic_search.pd.read_csv("anamika_os.csv")
search_base = semantic_search.pd.read_csv("srl_emails.csv", usecols=['project_name', 'month', 'date', 'message_id', 'reply'])
print(search_base.columns)
search_base = search_base.dropna()
search_base = search_base.reset_index(drop=True)
search_base

Index(['project_name', 'month', 'date', 'message_id', 'reply'], dtype='object')


Unnamed: 0,project_name,month,date,message_id,reply
0,jena,16.0,2012-03-25 14:22:54,<4F6F1C2E.7000606@googlemail.com>,robert vesse wrote : i'd like to see faster re...
1,impala,6.0,2016-06-29 05:19:40,<201606290519.u5T5JeWS014957@ip-10-146-233-104...,bharath vissapragada has posted comments on th...
2,oozie,12.0,2012-07-16 18:30:36,<317574600.59506.1342463436256.JavaMail.jirato...,mona chitnis updated oozie - 907 :
3,hama,6.0,2008-11-12 20:57:55,<eb4706e0811120357x312a6d66ga5a18bedbeeb30a3@m...,"oh, problem is a block size. blockeddensematri..."
4,carbondata,4.0,2016-10-21 14:07:06,<20161021140706.22E7ADFDC4@git1-us-west.apache...,github user zhangshunyu closed the pull reques...
...,...,...,...,...,...
952790,iotdb,1.0,2018-12-20 16:26:51,<42cf87a3.2c890.167cab978c1.Coremail.x-y16@mai...,i prefer jira.
952791,cloudstack,6.0,2012-10-03 07:55:46,<9DDC9CE7-F7A6-421F-9A50-A887EF814B1D@basho.com>,has their been any consideration given to shif...
952792,netbeans,3.0,2017-01-01 16:08:28,<20170101160828.73540.19791@johns-mbp-2.home>,this email was sent by an automated system on ...
952793,qpid,4.0,2006-12-13 11:12:21,<6492006.1166037141581.JavaMail.jira@brutus>,gordon sim commented on qpid - 65 :


#**Lexical_Search of the query on the Database & BM25Okapi Scores**

[SETH SUGGEST]
# **Query**
The code below performs a query on your uploaded database.

On our sample database of 100K sentences, this code takes about 21 minutes to run.

[SETH SUGGEST CODE CHANGE:]

In [None]:
query_text = query["policy.statement"][0] # this is the first formal policy from the list that was input
#query_text = "Be patient and kind with newcomers"  # this is not a formal policy, but can also be searched for, perhaps as a probe for potential emergent social norms
print(query_text)
obj = semantic_search.semantic_search()
filtered_data = obj.lex_search(query_text, search_base)# lex_search takes approx 21mins to run and produce filtered_data
filtered_data

each podling in incubation must report to the incubator pmc.
Be patient and kind with newcomers


100%|██████████| 952795/952795 [22:56<00:00, 691.95it/s]


Unnamed: 0,corpus_id,score
7496,"thanks all for the kind words, really enjoying...",13.796782
7204,"getting a release out is the # 1 task here, he...",13.011201
849,will be patient.,12.442479
2172,1 ) no it's not a requirement that you be a co...,12.401604
1571,"this is true, thanks for stating that. ( i wou...",11.846572
...,...,...
8715,mayankshriv commented on a change in pull requ...,1.185136
2826,mayankshriv commented on a change in pull requ...,1.185136
11023,author : lischke date : fri jan 14 11 : 13 : 1...,1.180191
3054,"hi regina, imho besides of being incomplete, t...",1.180191


In [None]:
obj = semantic_search.semantic_search()
filtered_data = obj.lex_search(query["policy.statement"][0], search_base)# lex_search takes approx 21mins to run and produce filtered_data
filtered_data

100%|██████████| 952795/952795 [22:51<00:00, 694.48it/s]


Unnamed: 0,corpus_id,score
12664,"as far as the podling is concerned, a report n...",25.016027
99630,do you know if we should have submitted a podl...,22.620396
16130,the incubator pmc would appreciated if you cou...,21.366746
104969,the incubator pmc would appreciated if you cou...,21.366746
62015,the incubator pmc would appreciated if you cou...,21.366746
...,...,...
53537,i've directly pushed 1 commit of pom. xml to e...,0.499596
8082,beiwei30 closed pull request # 11 : update dub...,0.492551
98408,http : / / bugzilla. spamassassin. org / show ...,0.490821
75997,nzomkxia closed pull request # 10 : update dub...,0.489103


#**Sentence Embeddings**
[SETH SUGGESTS]
This code extracts numerical representations of each sentence of the database. In this method, each sentence is represented by a vector of 768 numbers: a point in a 768-dimensional abstract semantic space. The magic of neural networks is that the coordintes they assign to sentences in this space have the property that sentences with similar coordinates have similar subjective meanings, even taking context into account. Once extracted, these coordinates, can be used in clustering algorithms (to detect categories of statements), visualizations (to see patterns), and even regressions (to build rich null models).

On our sample database of 100K sentences, this code takes about 55 minutes to run.

In [None]:
data = filtered_data['corpus_id']
data_embed = obj.sentence_embeddings_encode(semantic_search.word_embedding_model, data)# sentence encodings takes approx 55mins to be available

#**Cosine Similarities**
[SETH SUGGESTS:]
This is code that computes the cosine similarity between the query and each statement in the database.  Cosine similarity, providing a number between 0 and 1, is a standard siilarity measure for these applications. That is because more familiar measures like Euclidian distance start to become uninformative in high-dimensional spaces like neural embeddings.

In [None]:
cos_similarities = obj.dot_score(query["policy.statement"][0],data_embed)

100%|██████████| 107142/107142 [00:10<00:00, 10021.59it/s]


#**Similarity Scores after Semantic Search**

[SETH SUGGEST: cut header "Similarity Scores after Semantic Search" and replace with this text instead:]
We use a different more specialized similarity measure in the queries above (one called *Okapi BM25*), but we share cosine similarity here because it is a good workhorse, and overall succeeds at reproducing BM25's ordering:

[SETH SUGGEST: is there a way to get the output ofthe similarity column below to say `0.8241` instead of `[[tensor(0.8241)]]`? I tried replacing `cos_similarities` with `float(cos_similarities[0])` but that produced very different values. I must not understand tensors.]


In [None]:
sorted_df = semantic_search.pd.DataFrame({'Data': data, 'Cosine Similarity': cos_similarities}).sort_values(by='Cosine Similarity', ascending=False)
sorted_df

Unnamed: 0,Data,Cosine Similarity
76011,xianxianya opened a new issue # 9108 : echarts...,[[tensor(0.8241)]]
71123,the vote passes with 3 + 1 votes ( including 3...,[[tensor(0.7675)]]
72356,i added some brief notes to the report for win...,[[tensor(0.7395)]]
101428,mbeckerle commented on a change in pull reques...,[[tensor(0.7364)]]
38085,forwarding messages - - - - - - - - date : 201...,[[tensor(0.7364)]]
...,...,...
100989,tvm is a community - driven project and we lov...,[[tensor(-0.0941)]]
71740,it's at : http : / / svn. apache. org / viewvc...,[[tensor(-0.1083)]]
6290,github user yonzhang opened a pull request : y...,[[tensor(-0.1163)]]
9702,"you might also send it to announce @, i believ...",[[tensor(-0.1256)]]


#**Convert results to csv and Download**

[SETH SUGGEST:]
#**Convert results to `.csv` and download**
[SETH SUGGEST: this code converts but doesn't download yet. pls finish it. also, make sure you export the version with BM25's. Don't include cosine similarities in the downloaded file unless you can get them to appear as numbers in the csv file.  Otherwise I'm concerned that file, when uploaded to another program, won't import those values as numbers, which would eliminate thepoint. Please check. ]
[SETH SUGGEST: what's that last code block doing?]

In [None]:
csv_file_path = "sorted_cosine_similarities.csv"
sorted_df.to_csv(csv_file_path, index=False)

In [None]:
## main function to input arguments to select query

