# GitHub Pull Requests dataset
## The dataset
The dataset used for the base level knowledge augmentation can be found at
https://www.kaggle.com/datasets/pelmers/github-public-pull-request-comments.

It contains JSONs of Pull Request (file paths, comments, and diffs, among other things) from  mined from permissively-licensed GitHub public projects with at least 25 stars and 25 pull requests submitted at the time of access and covers Go, Java, JavaScript, TypeScript, and Python.

## Data pipeline
The task of this notebook is to ingest each Pull Request in the dataset, embed it using a feature extraction model, and upload it to a vector database in order to enable Retrieval Augmented Generation (RAG).
Given the size of the overall dataset (over 30GB), this notebook will focus on the JavaScript portion of the dataset as a proof of concept.

### Data treatment
Some level of treatment of the data is necessary given that:
- All the data is present in a single json file, which is a barrier to parallelization of the embeddings upload
- And although they fit the 25 start and 25 pull request requirement to be mined, for various reasons including PRs not being related to code covered by the dataset, some repositories present in the dataset had no data

So in this notebook the subdataset is split into a json file for each repository in a manner that will facilitate future data processing pipelines of those files.

In [1]:
from tqdm.auto import tqdm
import pandas as pd

df = pd.read_json('dataset/mined-comments-25stars-25prs-JavaScript.json/mined-comments-25stars-25prs-JavaScript.json', orient='index')

In [2]:
# checking how many rows it had before
df.shape

(28835, 9224)

In [3]:
# remove any row that has not data
df = df.dropna(axis=0, how='all')

In [4]:
# checking how many rows it had after
df.shape

(7366, 9224)

In [5]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9214,9215,9216,9217,9218,9219,9220,9221,9222,9223
trekhleb/javascript-algorithms,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,{'html_url': 'https://github.com/trekhleb/java...,...,,,,,,,,,,
airbnb/javascript,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,{'html_url': 'https://github.com/airbnb/javasc...,...,,,,,,,,,,
twbs/bootstrap,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,{'html_url': 'https://github.com/twbs/bootstra...,...,,,,,,,,,,
30-seconds/30-seconds-of-code,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,{'html_url': 'https://github.com/30-seconds/30...,...,,,,,,,,,,
d3/d3,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,{'html_url': 'https://github.com/d3/d3/pull/10...,...,,,,,,,,,,


In [6]:
df.index

Index(['trekhleb/javascript-algorithms', 'airbnb/javascript', 'twbs/bootstrap',
       '30-seconds/30-seconds-of-code', 'd3/d3', 'facebook/react',
       'facebook/react-native', 'facebook/create-react-app', 'axios/axios',
       'vercel/next.js',
       ...
       'ezsystems/ezoe', 'mac-/hapi-statsd', 'twinlabs/forum',
       'strongloop/loopback-sdk-angular-cli', 'bem/bem-mvc',
       'manguezal/manguezal.github.com', 'Ember-SC/peepcode-ordr-test',
       'apache/cordova-coho', 'papandreou/node-jpegtran',
       'meetup/meetup-web-components'],
      dtype='object', length=7366)

In [7]:
df.columns

RangeIndex(start=0, stop=9224, step=1)

In [8]:
# saving each row of the df to a dataframe
dfs = []
for i in tqdm(range(len(df)), 'Separating data by repository'):
    dfs.append(pd.DataFrame(df.iloc[i]).T)

Separating data by repository:   0%|          | 0/7366 [00:00<?, ?it/s]

In [9]:
# deleting the df variable to save memory
del df

In [10]:
# dropping cells with empty data in each dataframe
for i in tqdm(range(len(dfs)), 'Removing empty cells'):
    dfs[i] = dfs[i].dropna(axis='columns', how='all')

Removing empty cells:   0%|          | 0/7366 [00:00<?, ?it/s]

In [11]:
# ordering the dataframes by the number of PRs of each repository in descending order
dfs = sorted(dfs, key=lambda x: len(x.columns), reverse=True)

In [12]:
dfs[len(dfs) - 1]

Unnamed: 0,0
manguezal/manguezal.github.com,{'html_url': 'https://github.com/manguezal/man...


In [13]:
dfs[0][0][0]

{'html_url': 'https://github.com/plotly/plotly.js/pull/1#discussion_r44735997',
 'path': 'devtools/test_dashboard/server.js',
 'line': 36,
 'body': 'the test dashboard server script uses the same watch-bundling machinery as `npm run watch` \n:palm_tree: :palm_tree: \n',
 'user': 'etpinard',
 'diff_hunk': "@@ -1,89 +1,53 @@\n-var http = require('http');\n-var ecstatic = require('ecstatic');\n-var browserify = require('browserify');\n-var open = require('open');\n var fs = require('fs');\n-var watchify = require('watchify');\n+var http = require('http');\n var path = require('path');\n-var outpipe = require('outpipe');\n-var outfile = path.join(__dirname, '../shelly/plotlyjs/static/plotlyjs/build/plotlyjs-bundle.js');\n-\n-var testFile = './test';\n-\n-switch(process.argv[2]) {\n-  case 'geo':\n-    testFile = './test-geo';\n-  break;\n-  case '2d':\n-    testFile = './test-2d';\n-  break;\n-}\n-\n-console.log('using ' + testFile);\n-\n-var b = browserify(path.join(__dirname, '../shelly/

In [14]:
for i in tqdm(range(len(dfs)), 'Replace every / in the index with a -'):
    dfs[i].index = dfs[i].index.str.replace('/', '@')

Replace every / in the index with a -:   0%|          | 0/7366 [00:00<?, ?it/s]

In [15]:
# create repo-split folder
import os

if not os.path.exists('dataset/mined-comments-25stars-25prs-JavaScript.json/repo-split'):
    os.makedirs('dataset/mined-comments-25stars-25prs-JavaScript.json/repo-split')

# saving each df to a json file with the repository name as the file name with i in the name so we know the order from the most to the least columns
for i in tqdm(range(len(dfs)), 'Saving dataframes to json files'):
    dfs[i].to_json('dataset/mined-comments-25stars-25prs-JavaScript.json/repo-split/' + str(i+1) + '-' + dfs[i].index[0] + '.json')

Saving dataframes to json files:   0%|          | 0/7366 [00:00<?, ?it/s]

In [16]:
# deleting the dfs variable to save memory
del dfs

### Data embedding and vector uploading
`create a queue of the file names -> then spawn 10 threads -> each thread takes a file from the queue -> generates the proper datastructure/dataframe from the json file -> sends to the inference api or runs the transformer for embeddings -> takes the resulting embedding -> generates the proper datastructure to upload to a vector database (pinecone) and does so in chuncks of 100 vectors, as per the pinecone documentation -> when finished with a file save to a log.`

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()

PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
HUGGINGFACE_API_KEY = os.getenv('HUGGINGFACE_API_KEY')

In [2]:
import pinecone

pinecone.init(
    api_key=str(PINECONE_API_KEY),
    environment='gcp-starter'
)

  from tqdm.autonotebook import tqdm


#### In case we need to delete and recreate the index to start fresh

In [3]:
# pinecone.delete_index('review-owl')

In [4]:
# pinecone.create_index('review-owl', dimension=1024, metric='euclidean', pods=1, pod_type='starter')

In [5]:
PINECONE_POOL_THREADS = 30
index = pinecone.Index('review-owl', pool_threads=PINECONE_POOL_THREADS)

In [6]:
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [7]:
# importing a specific repo dataset by finding all the files starting with the index number and a dash
import glob

# getting the index number from the file name
def get_index_number(file):
    """A helper function to get the index number from the file name."""
    return int(file.split('\\')[1].split('-')[0])

def import_filepaths(folder_path: str):
    # getting all the files in the directory
    file_paths = glob.glob(folder_path)

    # sorting the files by the index number
    sorted_filepaths = sorted(file_paths, key=get_index_number)

    return sorted_filepaths

In [8]:
from tqdm.auto import tqdm
import pandas as pd
import json
import requests
import time
import random
import itertools

# API_URL = "https://api-inference.huggingface.co/models/BAAI/bge-base-en-v1.5"
API_URL = "https://api-inference.huggingface.co/models/BAAI/bge-large-en-v1.5"
# API_URL = "https://api-inference.huggingface.co/models/BAAI/bge-small-en-v1.5"
# API_URL = "https://api-inference.huggingface.co/models/BAAI/llm-embedder"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_KEY}"}

def file_to_embedding_inputs(file_path: str):
    embedding_input = []
    
    data = pd.read_json(file_path, orient='index')
    for row in tqdm(data.iloc[:, 0], 'Splitting file into payloads', position=2):
        embedding_input.append('path: ' + row['path'] + '\n' + 'diff_hunk: ' + row['diff_hunk'])
    
    return embedding_input

def huggingface_inference_api_request(payload: list[str]):
    try:
        data = json.dumps(payload)
        response = requests.request("POST", API_URL, headers=headers, data=data)

        while (response.status_code != 200 and 'is currently loading' in response.content.decode("utf-8")):
            time.sleep(20)
            print(response.content.decode("utf-8"))
            response = requests.request("POST", API_URL, headers=headers, data=data)

        return json.loads(response.content.decode("utf-8"))
    except Exception as e:
        print(e)
        return e

def make_huggingface_request_with_backoff(payload: list[str]):
    max_retries = 10
    retries = 0

    while retries < max_retries:
        try:
            response = huggingface_inference_api_request(payload)
            return response
        except requests.HTTPError as e:
            print('HTTP error: ' + str(e.response.status_code))
            if e.response.status_code == 503:
                # Handle rate-limiting error
                wait_time = (2 ** retries) + (random.uniform(0, 1) * 0.1)  # Exponential backoff with random jitter
                time.sleep(wait_time)
                retries += 1
            else:
                print('Error: ' + str(e))
                # If it's not a 503 error, re-raise the exception
                raise e
        except Exception as e:
            # Handle other exceptions
            retries += 1

    # If all retries fail
    raise Exception("Max retries exceeded")

In [9]:
def split_into_chunks(iterable: list, batch_size=100):
    """A helper function to break an iterable into chunks of size batch_size."""
    it = iter(iterable)
    chunks = []
    for i in range(0, len(iterable), batch_size):
        chunks.append(list(itertools.islice(it, batch_size)))

    return chunks

def pinecone_upsert_request(index, payload):
    try:
        data = json.dumps(payload)
        response = requests.request("POST", API_URL, headers=headers, data=data)
        if (response.status_code != 200): raise Exception(response.json())
        return json.loads(response.content.decode("utf-8"))
    except Exception as e:
        print(e)
        return e

def make_pinecone_upsert_with_backoff(index, payload):
    max_retries = 3
    retries = 0

    while retries < max_retries:
        try:
            pinecone_responses = pinecone_upsert_request(index, payload)
            return pinecone_responses
        except requests.HTTPError as e:
            if e.response.status_code == 503:
                # Handle rate-limiting error
                wait_time = (2 ** retries) + (random.uniform(0, 1) * 0.1)  # Exponential backoff with random jitter
                time.sleep(wait_time)
                retries += 1
            else:
                # If it's not a 503 error, re-raise the exception
                raise e
        except Exception as e:
            # Handle other exceptions
            retries += 1

    # If all retries fail
    raise Exception("Max retries exceeded")

In [10]:
import threading
import queue
import hashlib

def hash_id(id):
    return hashlib.sha256(id.encode('utf-8')).hexdigest()

def embed_file(file_path: str):
    vectors_list = []
    embedding_inputs_list = file_to_embedding_inputs(file_path)

    for payload in tqdm(embedding_inputs_list, 'Generating vector embeddings', position=1):
        vectors_list.append(make_huggingface_request_with_backoff(payload))
    
    return vectors_list

def generate_metadata(file_path: str):
    metadata_list = []
    data = pd.read_json(file_path, orient='index')

    for row in tqdm(data.iloc[:, 0], 'Splitting file into payloads', position=2):
        metadata_list.append({'repo': data.columns[0],'path': row['path'], 'diff': row['diff_hunk'], 'body': row['body']})
    
    return metadata_list

def generate_vector_ids(metadata_list: list[dict]):
    ids = []
    for i in tqdm(range(len(metadata_list)), 'Generating vector ids', position=2):
        ids.append(metadata_list[i]['repo'] + '-' + hash_id(str(i)))
    return ids

def generate_upsert_data(file_path: str):
    embedding_result = embed_file(file_path)
    metadata_list = generate_metadata(file_path)
    vector_ids = generate_vector_ids(metadata_list)

    upsert_data = list(zip(vector_ids, embedding_result, metadata_list))

    return upsert_data

def worker(file_queue: queue.Queue):
    while True:
        file_path = file_queue.get()
        if file_path is None:
            break

        upsert_data = generate_upsert_data(file_path)

        upsert_data_chunks = split_into_chunks(upsert_data)

        for chunk in tqdm(upsert_data_chunks, 'Uploading data to Pinecone', position=1):
            index.upsert(chunk)
        print(f"Logging: {file_path} processed")

        file_queue.task_done()

# Function to create and manage worker threads
def create_worker_threads(num_threads: int, file_queue: queue.Queue):
    threads = list[threading.Thread]()

    for _ in range(num_threads):
        thread = threading.Thread(target=worker, args=(file_queue,))
        thread.start()
        threads.append(thread)

    return threads

# Main function to process files using multiple threads
def process_files_with_threads(file_names: list[str], num_threads: int, start: int = 0, end=None):
    # Create a thread-safe queue
    file_queue = queue.Queue()

    # Populate the queue with file names
    for file_name in file_names[start:end]:
        file_queue.put(file_name)

    # Create worker threads
    threads = create_worker_threads(num_threads, file_queue)

    # Wait for all file processing to be completed
    file_queue.join()

    # Stop the worker threads
    for _ in range(num_threads):
        file_queue.put(None)

    for thread in threads:
        thread.join()

    print("All files processed.")

### Embedding models and processing

In [11]:
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# # alternative embedding model to consider
# # model = SentenceTransformer('BAAI/llm-embedder')

In [12]:
file_paths = import_filepaths('dataset/mined-comments-25stars-25prs-JavaScript.json/repo-split/*.json')
process_files_with_threads(file_names=file_paths, num_threads=PINECONE_POOL_THREADS)

Splitting file into payloads:   0%|          | 0/8230 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7108 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7202 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/8545 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7324 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7695 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/6430 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/8576 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7373 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/5932 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/6931 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7429 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/8146 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/9083 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/9224 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7237 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/6894 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7091 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/8032 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/6224 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7098 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/8037 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7874 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/8859 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7221 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7454 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7083 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/7524 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/8081 [00:00<?, ?it/s]

Splitting file into payloads:   0%|          | 0/6813 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7202 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/8081 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/9224 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/8230 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/6894 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/8146 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7524 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/6931 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/5932 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7874 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7373 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7221 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/8576 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7429 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/6430 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7108 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7237 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/8032 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/8859 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/8037 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/8545 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7695 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/9083 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7324 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7454 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7098 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7091 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/6224 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/7083 [00:00<?, ?it/s]

Generating vector embeddings:   0%|          | 0/6813 [00:00<?, ?it/s]

In [None]:
index.describe_index_stats()