# GitHub Pull Requests dataset
## The dataset
The dataset used for the base level knowledge augmentation can be found at
https://www.kaggle.com/datasets/pelmers/github-public-pull-request-comments.

It contains JSONs of Pull Request (file paths, comments, and diffs, among other things) from  mined from permissively-licensed GitHub public projects with at least 25 stars and 25 pull requests submitted at the time of access and covers Go, Java, JavaScript, TypeScript, and Python.

## Data pipeline
The task of this notebook is to ingest each Pull Request in the dataset, embed it using a feature extraction model, and upload it to a vector database in order to enable Retrieval Augmented Generation (RAG).
Given the size of the overall dataset (over 30GB), this notebook will focus on the JavaScript portion of the dataset as a proof of concept.

### Data treatment
Some level of treatment of the data is necessary given that:
- All the data is present in a single json file, which is a barrier to parallelization of the embeddings upload
- And although they fit the 25 start and 25 pull request requirement to be mined, for various reasons including PRs not being related to code covered by the dataset, some repositories present in the dataset had no data

So in this notebook the subdataset is split into a json file for each repository in a manner that will facilitate future data processing pipelines of those files.

In [None]:
from tqdm.auto import tqdm
import pandas as pd

df = pd.read_json('dataset/mined-comments-25stars-25prs-JavaScript.json/mined-comments-25stars-25prs-JavaScript.json', orient='index')

In [None]:
# checking how many rows it had before
df.shape

In [None]:
# remove any row that has not data
df = df.dropna(axis=0, how='all')

In [None]:
# checking how many rows it had after
df.shape

In [None]:
df.head()

In [None]:
df.index

In [None]:
df.columns

In [None]:
# saving each row of the df to a dataframe
dfs = []
for i in tqdm(range(len(df)), 'Separating data by repository'):
    dfs.append(pd.DataFrame(df.iloc[i]).T)

In [None]:
# deleting the df variable to save memory
del df

In [None]:
# dropping cells with empty data in each dataframe
for i in tqdm(range(len(dfs)), 'Removing empty cells'):
    dfs[i] = dfs[i].dropna(axis='columns', how='all')

In [None]:
# ordering the dataframes by the number of PRs of each repository in descending order
dfs = sorted(dfs, key=lambda x: len(x.columns), reverse=True)

In [None]:
dfs[len(dfs) - 1]

In [None]:
dfs[0][0][0]

In [None]:
for i in tqdm(range(len(dfs)), 'Replace every / in the index with a -'):
    dfs[i].index = dfs[i].index.str.replace('/', '@')

In [None]:
# create repo-split folder
import os

if not os.path.exists('dataset/mined-comments-25stars-25prs-JavaScript.json/repo-split'):
    os.makedirs('dataset/mined-comments-25stars-25prs-JavaScript.json/repo-split')

# saving each df to a json file with the repository name as the file name with i in the name so we know the order from the most to the least columns
for i in tqdm(range(len(dfs)), 'Saving dataframes to json files'):
    dfs[i].to_json('dataset/mined-comments-25stars-25prs-JavaScript.json/repo-split/' + str(i+1) + '-' + dfs[i].index[0] + '.json')

In [None]:
# deleting the dfs variable to save memory
del dfs