## Objective

This notebook demonstrates how you use text embedding provided by Vertex AI PaLM to index code files and retrieve using natural language queries.

## High Level Flow

Index Creation:
- Recursively list the files(.ipynb & .py) in github repo
- Extract code and markdown from the files
- Generate embeddings for each code strings
- Add embedding to the vector store

Model Prompting:
- User enters a prompt or asks a question as a prompt
- Generated embedding for the user prompt to capture semantics
- Search the vector store (SCANN) to retrieve the nearest embeddings (relevant documents) closer to the prompt
- Provide the urls for the matches files



In [None]:
!pip install google-cloud-aiplatform --upgrade --user

!pip install scann

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
GITHUB_TOKEN = ""
GITHUB_REPO = "GoogleCloudPlatform/vertex-ai-samples"
PROJECT_ID = "hsbedi-docai"
LOCATION = "us-central1"

In [None]:
import requests, time

#crawls a GitHub repository and returns a list of all files in the repository
def crawl_github_repo(url,is_sub_dir,access_token = f"{GITHUB_TOKEN}"):

    ignore_list = ['__init__.py']

    if not is_sub_dir:

        api_url = f"https://api.github.com/repos/{url}/contents"

    else:

        api_url = url

    headers = {
        "Accept": "application/vnd.github.v3+json",
        "Authorization": f"Bearer {access_token}"
                   }

    response = requests.get(api_url, headers=headers)
    response.raise_for_status()  # Check for any request errors

    files = []

    contents = response.json()
    # print(f"{contents}")

    for item in contents:
        # if item['type'] == 'file' and item['name'] not in ignore_list and (item['name'].endswith('.py') or item['name'].endswith('.ipynb')):
        if item['type'] == 'file' and item['name'] not in ignore_list and (item['name'].endswith('.ipynb')):
            files.append(item['html_url'])
        elif item['type'] == 'dir' and not item['name'].startswith("."):
            sub_files = crawl_github_repo(item['url'],True)
            time.sleep(.1)
            files.extend(sub_files)

    return files

In [None]:
code_files_urls = crawl_github_repo(GITHUB_REPO,False,GITHUB_TOKEN)

# Write list to a file
with open('file.txt', 'w') as f:
    for item in code_files_urls:
        f.write(item + '\n')


len(code_files_urls)

In [None]:
# Authenticate with Google Cloud credentials for Google colab
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [None]:
#Initialize the Vertex AI LL Models
from vertexai.preview.language_models import TextEmbeddingModel
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION)

embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

In [None]:
with open('file.txt') as f:
    files = f.read().splitlines()
len(files)

449

In [None]:
import requests
import nbformat
import json

# Extracts the python code from an ipynb file from github
def extract_python_code_from_ipynb(github_url):
    raw_url = github_url.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/")

    response = requests.get(raw_url)
    response.raise_for_status()  # Check for any request errors

    notebook_content = response.text

    notebook = nbformat.reads(notebook_content, as_version=nbformat.NO_CONVERT)

    python_code = None

    for cell in notebook.cells:
        if cell.cell_type == "code":
          if not python_code:
            python_code = cell.source
          else:
            python_code += "\n" + cell.source

    return python_code

def extract_python_code_from_py(github_url):
    raw_url = github_url.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/")

    response = requests.get(raw_url)
    response.raise_for_status()  # Check for any request errors

    python_code = response.text

    return python_code

In [None]:
code_strings = []

for i in range(0, len (code_files_urls)):
    if code_files_urls[i].endswith(".ipynb"):
        code_strings.append(extract_python_code_from_ipynb(code_files_urls[i]))
    else:
        code_strings.append((extract_python_code_from_py(code_files_urls[i])))

In [None]:
code_embeddings = []
BATCH_SIZE = 5

for batch_start in range(0, len(code_strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = code_strings[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end-1}")

    batch_embeddings = [emb.values for emb in embedding_model.get_embeddings(batch)]  # get embeddings for each batch

    # batch_embeddings = [e["embedding"] for e in embedding_model.get_embeddings(df.text.values)]
    code_embeddings.extend(batch_embeddings)

Batch 0 to 4
Batch 5 to 9
Batch 10 to 14
Batch 15 to 19
Batch 20 to 24
Batch 25 to 29
Batch 30 to 34
Batch 35 to 39
Batch 40 to 44
Batch 45 to 49
Batch 50 to 54
Batch 55 to 59
Batch 60 to 64
Batch 65 to 69
Batch 70 to 74
Batch 75 to 79
Batch 80 to 84
Batch 85 to 89
Batch 90 to 94
Batch 95 to 99
Batch 100 to 104
Batch 105 to 109
Batch 110 to 114
Batch 115 to 119
Batch 120 to 124
Batch 125 to 129
Batch 130 to 134
Batch 135 to 139
Batch 140 to 144
Batch 145 to 149
Batch 150 to 154
Batch 155 to 159
Batch 160 to 164
Batch 165 to 169
Batch 170 to 174
Batch 175 to 179
Batch 180 to 184
Batch 185 to 189
Batch 190 to 194
Batch 195 to 199
Batch 200 to 204
Batch 205 to 209
Batch 210 to 214
Batch 215 to 219
Batch 220 to 224
Batch 225 to 229
Batch 230 to 234
Batch 235 to 239
Batch 240 to 244
Batch 245 to 249
Batch 250 to 254
Batch 255 to 259
Batch 260 to 264
Batch 265 to 269
Batch 270 to 274
Batch 275 to 279
Batch 280 to 284
Batch 285 to 289
Batch 290 to 294
Batch 295 to 299
Batch 300 to 304
Batch 3

In [None]:
len(code_embeddings)

449

# SCANN Index Creation

In [None]:
# Change the num_leaves and training_sample size based on your corpus
import timeit
import scann
import numpy as np
start = timeit.default_timer()
normalized_dataset = code_embeddings / np.linalg.norm(code_embeddings, axis=1)[:, np.newaxis]
searcher = scann.scann_ops_pybind.builder(normalized_dataset, 10, "dot_product").tree(
    num_leaves=200, num_leaves_to_search=100, training_sample_size=1000).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(100).build()
elapsed = timeit.default_timer() -start
#Printing the Elapsed Time
elapsed

0.44209675900037837

In [None]:
#Function to query the SCANN Index. We use 3 Neighbors to return
#Increasing the neighbors will slow the query down
import timeit
def search_posts(query_embedding, num_results=5):
    start = timeit.default_timer()

    neighbors, distances = searcher.search(query_embedding, final_num_neighbors=num_results) #change the number of neighbors for number of docs returned.
    elapsed = timeit.default_timer() -start
    return neighbors, distances, elapsed


In [None]:
# Just testing the SCANN API
queries = ["model training profiling for tensorflow",
           "Recomender system",
           "pytorch sample for Text classification"]
query_embeddings = embedding_model.get_embeddings(queries)

In [None]:
for i in range(0, len(queries)):
    print(f"Query: {queries[i]}")
    neighbors, distances, elapsed  = search_posts(query_embeddings[i].values,5)
    print(neighbors)
    print(distances)

    for result_index in neighbors:
        print(files[result_index])

Query: model training profiling for tensorflow
[251 242 324 422   9]
[0.72467923 0.72459215 0.7197988  0.7195308  0.715413  ]
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/sdk/SDK_Explainable_AI_Custom_Tabular.ipynb
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/reduction_server/distributed-training-reduction-server.ipynb
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/custom/custom_training_tensorboard_profiler.ipynb
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/tensorboard/tensorboard_profiler_custom_training.ipynb
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/tf_agents_bandits_movie_recommendation_with_kfp_and_vertex_sdk/step_by_step_sdk_tf_agents_bandits_movie_recommendation/step_by_step_sdk_tf_agents_bandits_movie_recommendation.ipynb
Query: Recomender system
[  9 448 434 385 388]
[0.63