# Code Clarity Tutorial - simple python code search example 
In this notebook we hope to go through a simple example of how to use codeclarity to generate vectors of source
code and natrual language to build a simple vector retrieval application to power a very simple search engine. 

## Through the course of this notebook, we will cover-
1. The basics of semantic search
2. The applications of dense vectors for information retrieval
3. A model that was finetuned for code search of python-english source code. 
4. Implimenting a simple search of an example large repository


## Goal : semantic search of code 

To quote from Neil Reimers project [sentence transformers](https://www.sbert.net/examples/applications/semantic-search/README.html), which is an excellent resource for learning about dense vectors and their uses in NLP at large:

"*Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.*"


This means that semantic search differs from it's predecessor, lexical search, in that instead of looking specific patterns that match in a retrieval object such as keyword matching, we aim to use machine learning models to capture a deeper contextual meaning of objects that we can map to a textual description- namely a query.

 This is an idea that has been adopted to great success to power every major search engine in the past 5 years, and has the benefit of being able to power search for arbitary sequences of symbols, be them [text](https://www.sbert.net/docs/pretrained_models.html), [images](https://openai.com/blog/clip/) or [audio](https://arxiv.org/abs/1904.05073).
 


## Applications of dense vectors
### Sparse VS Dense Vectors
Before the common prevelence of deep learning models in natrual language processing across machine learning, before the use of  release of [BERT](https://arxiv.org/abs/1706.03762) and [Word2Vec Varients](https://arxiv.org/abs/1301.3781), a common mathmatical representation of textual features was [TF-IDF](https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3): the Term Frequency Inverse Document Frequency of a sequence.

 We won't explain it here, but TF-IDF is an example of a sparse vector representation. There are as many elements in the array storing the representation as there are unique words in the training corpus. This being said, a given piece of text, especially if it has a small vocabulary, will consist of mostly zeros. 

 This contrasts with for example, a [S-BERT sentence embedding](https://www.sbert.net/examples/applications/computing-embeddings/README.html), where no matter the length of the input string, it will be mapped to a fixed length array, the indices of whom are mostly nonzero integers. 

### Using dense vectors for information retrieval
Although the details of training a dense retrieval model to generate embedddings is outside the scope of this tutorial, we will note that in the process of the finetuning of a neural retrieval model, we directly compare queries to candidate items in such a way that the embedding representation of each is mapped into a shared metric space. This vector space can be exploited to generate a similarity score between a given query, and all the items in a search index. 

A common candidate used for search similarity at the embedding time is [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity), the formula for which is below: 

ADD FORMULA HERE 

This will output a score of similarity between 1 and -1, where a higher score denotes a better fit of an item in the search index to a query. A simple search implimentation, like the one in this tutorial may simply calculate the cosine similarity (or inner product for vector embeddings that are not normalized values) And perform a sort of the items in the index to find the 'nearest neibours' of the query.

## Models used
### UnixCoder and Code Understanding transformers 
This section will be kept short in the name of brevvity and the depth of this subdomain, but in short the finetuning of neural models for a given task will often fall short of state of the art performance if the foundational model has significant drift in its pretraining data from the data on the downstream task. 

For example, a model pretrained on an english corpus, say Bert-Large-Uncased, will likely have decreased performance on a text classification task in spanish. In a similar sense source code and natrual language unsuprisingly have stark differences in vocabulary, lexical structure and syntax. Thus, we start with a foundational model that has been shown to have strong understanding of both english and multiple programming languages as a result of its pretraining schema and data - [UniXCoder](THIS NEEDS A LINK)

Due to being trained on both source code and text, it proves highly suitable for code search tasks, as shown by its strong performance on prominent code search benchmarks like [CodeSearchNet](THIS NEEDS A LINK). Despite the release of the pretrained models by the authors, the full inference pipeline can prove tricky. This is where codeclarity comes in! 

## Implimenting a simple search of an example large repository
### Data, Stack and General Workflow 
To illustrate the use of a codesearch model, we try to build a code understanding tool using an opensource python repository of reasonable size - the AWS python SDK. This dataset spans well over 1000 functions and would require quite some onboarding for contribution in terms of finding where specific functionality is implimented.

Thus, we will use codeclarity to impliment a very simple code search tool which:
1. Embeds all of the code snippits in the repository to dense vectors
2. Builds an index of these vectors that is searchable for it's nearest neibours
3. Uses these neibour indicies to retrieve the function snippit and it's url on github

Lets get started! We will be using the pretrained checkpoints for UnixCoder-base : a code understanding model that in addition to being able to model our queries, understands python, javascript, java, go and ruby.

In [1]:
#Install the master branch of codeclarity
!pip install git+https://github.com/DocumaticAI/CodeClarity

Collecting git+https://github.com/DocumaticAI/CodeClarity
  Cloning https://github.com/DocumaticAI/CodeClarity to /tmp/pip-req-build-kvvi14lj
  Running command git clone --filter=blob:none --quiet https://github.com/DocumaticAI/CodeClarity /tmp/pip-req-build-kvvi14lj
  fatal: unable to access 'https://github.com/DocumaticAI/CodeClarity/': Could not resolve host: github.com
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mgit clone --[0m[32mfilter[0m[32m=[0m[32mblob[0m[32m:none --quiet [0m[4;32mhttps://github.com/DocumaticAI/CodeClarity[0m[32m [0m[32m/tmp/[0m[32mpip-req-build-kvvi14lj[0m did not run successfully.
  [31m│[0m exit code: [1;36m128[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mgit clone --[0m[32mfilter[0m[32m=[0m[32mblob[0m[32m:none --quiet [0m[4;32mh

In [2]:
#Imports 
import codeclarity
import pandas as pd 
#import faiss
import numpy as np
import pathlib

In [3]:
df = pd.read_csv("aws_python_sdk.csv").drop(columns=["Unnamed: 0"])
df.head(5)

Unnamed: 0,code,language,name,filepath
0,"def read(fname):\n """"""\n Args:\n ...",python,read,setup.py
1,"def read_version():\n return read(""VERSION""...",python,read_version,setup.py
2,"def read_requirements(filename):\n """"""Reads...",python,read_requirements,setup.py
3,"def setup(app):\n sys.stdout.write(""Generat...",python,setup,doc/conf.py
4,"def get_jumpstart_sdk_manifest():\n url = ""...",python,get_jumpstart_sdk_manifest,doc/doc_utils/jumpstart_doc_utils.py


In [4]:
from codeclarity import CodeEmbedder
embedding_model = CodeEmbedder(base_model= "microsoft/unixcoder-base")

Search retrieval model for allowed_languages ['java', 'ruby', 'python', 'php', 'javascript', 'go'] loaded correctly to device cuda in 1.9021415710449219 seconds


In [5]:
df['embeddings'] = df['code'].map(lambda x: embedding_model.encode(x, language= "python",))

32it [00:00, 32.37it/s]              
32it [00:00, 1118.34it/s]            
32it [00:00, 1536.71it/s]            
32it [00:00, 1767.63it/s]            
32it [00:00, 1630.16it/s]            
32it [00:00, 1634.87it/s]            
32it [00:00, 1804.83it/s]            
32it [00:00, 1738.68it/s]            
32it [00:00, 1848.83it/s]            
32it [00:00, 1345.46it/s]            
32it [00:00, 1838.95it/s]            
32it [00:00, 1839.61it/s]            
32it [00:00, 1440.03it/s]            
32it [00:00, 1751.02it/s]            
32it [00:00, 1747.88it/s]            
32it [00:00, 1793.40it/s]            
32it [00:00, 1734.03it/s]            
32it [00:00, 1694.78it/s]            
32it [00:00, 1747.35it/s]            
32it [00:00, 1287.66it/s]            
32it [00:00, 978.02it/s]             
32it [00:00, 1242.39it/s]            
32it [00:00, 1636.26it/s]            
32it [00:00, 1296.25it/s]            
32it [00:00, 1536.23it/s]            
32it [00:00, 1880.41it/s]            
32it [00:00,

KeyboardInterrupt: 