Issues + PRs Extraction using GraphQL

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import json
import os
import csv
import re
import glob
from queries import *
from tokens_utils import *


In [None]:
def read_sample_csv(path):
    with open(path, "r") as f:
        df = pd.read_csv(path)
        return df


path = "./repositories.csv"
repos = read_sample_csv(path)

for idx, repo in repos.iterrows():
    print(f"Fetching closed issues for {repo['owner']}/{repo['repository']} ({repo['language']})")
    # change the entity between: issues, prs, and commit
    get_repo_info(repo['owner'], repo['repository'], 'issues')
    # the results are automatically stored as pickle files in ./issues/ | ./prs/ | ./commit/ per year
    # The complete extracted data can be found in RQ1/issues/, RQ2/prs/, and RQ3/commit in data folder.


Fetching closed issues for huggingface/transformers (Python)
Year 2020: Retrieved 0 issues.
Year 2021: Retrieved 0 issues.
Year 2022: Retrieved 0 issues.
Year 2023: Retrieved 7 issues.
Year 2024: Retrieved 0 issues.
Year 2025: Retrieved 0 issues.


Issues Preparation

In [None]:
# Auxiliary methods to prepare issues for bug classification
issues_pkl = glob.glob(os.path.join('./issues_content', '*.pkl'))
issues_df = pd.concat([pd.read_pickle(file) for file in issues_pkl], ignore_index=True)
# Because of the large information contained in issues, we separately extracted the dates of closing and creation of issues
# All data on dates can be found in RQ1/issues_dates
# We provide an already filtered set: timebox_sample.pkl
closed_df = pd.read_pickle("./timebox_sample.pkl")

closed_df['repo'] = closed_df['url'].str.split('/').str[4]
issues_df = issues_df[issues_df['url'].isin(closed_df['url'])]
issues_df['repo'] = issues_df['url'].str.split('/').str[4]


# We need to format comments and reduce tokens that exceed the limit of the model
issues_df['comments'] = issues_df['comments'].apply(normalize_comments)
issues_df["text"] = issues_df["title"].str.strip() + "\n" + issues_df["bodyText"].str.strip()
issues_df["text"] = issues_df["text"].str.strip()

for idx, row in issues_df.iterrows():
    # Reduce for gpt-5-mini model
    text, comments = reduce_tokens_with_comments(row['text'], row['comments'], 127000)
    issues_df.at[idx, 'comments'] = comments 

groups = issues_df.groupby('repo')
for group in groups:
    pd.to_pickle(group[1], f"issues_per_project/{group[0]}.pkl")

Reducing tokens from 134120 to 127000
Reducing tokens from 299197 to 127000


Once the data is processed, go to: issues_classifier.ipynb

PR Filtering

In [21]:
prs_pkl = glob.glob(os.path.join('./prs', '*.pkl'))
prs_df = pd.concat([pd.read_pickle(file) for file in prs_pkl], ignore_index=True)
# Filter by date
prs_df = prs_df[prs_df['mergedAt'] >= '2020-11-11']
prs_df = prs_df[prs_df['mergedAt'] <= '2025-11-11']
prs_df
# Filter by fix #1234 pattern
pattern = r'(?i)\b(fix|fixes|close|closes|resolve|resolves)\b\s*(?:issue\s*)?((?:#\d+\s*,?\s*)+)'
filtered_pr = prs_df[prs_df['bodyText'].str.contains(pattern, flags=re.IGNORECASE, regex=True, na=False)]
# Extract diff
filtered_pr[['owner', 'repo']] = filtered_pr['url'].str.extract(r'github\.com/([^/]+)/([^/]+)')
filtered_pr['diff'] = None

for idx, row in filtered_pr.iterrows():
    filtered_pr.loc[idx, 'diff'] = queries.get_diff(row['url'])
filtered_pr.to_pickle('./prs_with_diff.pkl')

  filtered_pr = prs_df[prs_df['bodyText'].str.contains(pattern, flags=re.IGNORECASE, regex=True, na=False)]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pr[['owner', 'repo']] = filtered_pr['url'].str.extract(r'github\.com/([^/]+)/([^/]+)')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pr[['owner', 'repo']] = filtered_pr['url'].str.extract(r'github\.com/([^/]+)/([^/]+)')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://

Unnamed: 0,number,title,url,bodyText,createdAt,mergedAt,owner,repo,diff
35,14490,chore(deps-dev): bump mocha from 11.0.1 to 11.1.0,https://github.com/nestjs/nest/pull/14490,Bumps mocha from 11.0.1 to 11.1.0.\n\nRelease ...,2025-01-23T00:27:08Z,2025-01-23T07:28:21Z,nestjs,nest,
123,14177,release: version 11.0.0,https://github.com/nestjs/nest/pull/14177,PR Checklist\nPlease check if your PR fulfills...,2024-11-20T14:52:32Z,2025-01-07T12:51:27Z,nestjs,nest,
125,13247,fix: typings for fastify `enableCors` method,https://github.com/nestjs/nest/pull/13247,Fastify adapter uses typings from @fastify/cor...,2024-02-22T06:44:27Z,2025-01-09T10:55:06Z,nestjs,nest,
292,14794,chore(deps-dev): bump axios from 1.7.9 to 1.8.3,https://github.com/nestjs/nest/pull/14794,Bumps axios from 1.7.9 to 1.8.3.\n\nRelease no...,2025-03-18T09:43:07Z,2025-03-18T10:11:12Z,nestjs,nest,
293,14792,fix(core): dependencies not resolving for requ...,https://github.com/nestjs/nest/pull/14792,PR Checklist\nPlease check if your PR fulfills...,2025-03-17T22:00:20Z,2025-03-19T08:46:20Z,nestjs,nest,
...,...,...,...,...,...,...,...,...,...
125977,10387,Deprecation warning on unused files,https://github.com/Comfy-Org/ComfyUI/pull/10387,Followup on #10366 to limit warning for files ...,2025-10-18T02:45:45Z,2025-10-19T20:05:46Z,Comfy-Org,ComfyUI,
125980,10376,Do batch_slice in EasyCache's apply_cache_diff,https://github.com/Comfy-Org/ComfyUI/pull/10376,Fixes #10344,2025-10-17T03:18:44Z,2025-10-17T04:39:37Z,Comfy-Org,ComfyUI,
125991,10351,Add TemporalScoreRescaling node,https://github.com/Comfy-Org/ComfyUI/pull/10351,Resolve #10214.\nTSR’s mechanism can be interp...,2025-10-15T07:42:53Z,2025-10-15T22:12:25Z,Comfy-Org,ComfyUI,
126002,10316,"update extra models paths example ""clip"" -> ""t...",https://github.com/Comfy-Org/ComfyUI/pull/10316,It is a bit confusing now that all docs and te...,2025-10-12T21:07:28Z,2025-10-13T03:35:33Z,Comfy-Org,ComfyUI,


With the data already filtered, go to embeddings_extractor.ipynb
The extracted prs with diffs can be found in RQ2/prs_with_diff.pkl