# Supporting code and data for "..."

In [1]:
%matplotlib inline

import os
import sys
print(f'Python {sys.version}')

import IPython
from IPython.core.display import display, HTML
print(f'IPython {IPython.__version__}')

print('\nLibraries:\n')

import csv
print(f'csv {csv.__version__}')

import matplotlib
import matplotlib.pyplot as plt
print(f'matplotlib {matplotlib.__version__}')

import numpy as np
print(f'numpy {np.__version__}')

import pandas as pd
from pandas.plotting import register_matplotlib_converters
print(f'pandas {pd.__version__}')

import re
print(f're {re.__version__}')

import requests
print(f'requests {requests.__version__}')

#import scipy
#import scipy.stats
#print(f'scipy {scipy.__version__}')


#import statsmodels
#import statsmodels.formula.api as smf
#from statsmodels.stats.outliers_influence import summary_table
#print(f'statsmodels {statsmodels.__version__}')

Python 3.9.6 (default, Jun 28 2021, 08:57:49) 
[GCC 10.3.0]
IPython 7.24.1

Libraries:

csv 1.0
matplotlib 3.4.2
numpy 1.20.3
pandas 1.2.4
re 2.2.1
requests 2.25.1


## Data collection

We use the GitHub GraphQL API because it allows fetching only the information we need, and at a much faster rate (we can get up to 100 nodes in a single request). Getting all the objects of a certain type requires then to repeat the request to go through all the pages of results.

You need to provide a personal `api_token` if you want to get fresh data from GitHub. Otherwise, this notebook will skip the data collection step and load the CSV files from the local filesystem.

In [2]:
api_token = ''

In [3]:
def requestAllPages(query,rows_and_next_variables,filename,columns):
  if api_token == '':
    return
  headers = {'Authorization': f'token {api_token}'}
  url = 'https://api.github.com/graphql'
  rows, variables = rows_and_next_variables(None)
  while len(variables)>0:
    json = {'query':query,'variables':variables.pop()}
    r = requests.post(url=url, json=json, headers=headers)
    if r.status_code == 403:
      print('Unauthorized request:')
      print(json)
    r.raise_for_status() # Abort if unsuccessful request
    new_rows, next_variables = rows_and_next_variables(r.json()['data'])
    rows += new_rows
    variables += next_variables
  if len(rows) > 0:
    with open(filename, 'w') as f:
      writer = csv.writer(f)
      writer.writerow(columns)
      writer.writerows(rows)

We look for all the pull requests where the bot has proposed to minimize the CI failures. We get them by searching for the words "coqbot ci minimize".
This query is redundant with the next one, but useful if one only wants the list of PRs.
Uncomment the last line and make sure to provide an `api_token` to run it.

In [4]:
def fetch_prs():

  query = """
    query getPullRequestList($cursor: String) {
      search(query: "repo:coq/coq coqbot ci minimize", type:ISSUE, first: 100, after: $cursor) {
        nodes {
          ... on PullRequest {
            number
          }
        }
        pageInfo {
          hasNextPage
          endCursor
        }
      }
    }
  """

  def rows_and_next_variables(data):
    if data is None:
      return [], [{}]
    else:
      rows = []
      pullRequests = data['search']
      for node in pullRequests['nodes']:
        if 'number' in node:
            rows.append([node['number']])
      if pullRequests['pageInfo']['hasNextPage']:
        return rows, [{'cursor':pullRequests['pageInfo']['endCursor']}]
      else:
        return rows, []

  requestAllPages(
      query,
      rows_and_next_variables,
      'pullrequests.csv',
      ['number']
  )

# fetch_prs()

Here, we search again for all PRs where CI minimization was proposed but we retrieve all the comments to know what happened. Uncomment the last line and make sure to provide an `api_token` to re-run this.

In [5]:
def fetch_pr_comments():

  query = """
    query commentQuery($number: Int!, $single: Boolean!, $prCursor: String, $commentCursor: String) {
      search(query: "repo:coq/coq coqbot ci minimize", type:ISSUE, first: 10, after: $prCursor) @skip (if: $single) {
        pageInfo {
          endCursor
          hasNextPage
        }
        nodes {
          ... pullRequest
        }
      }
      repository(owner: "coq", name: "coq") @include (if: $single) {
        pullRequest(number: $number) {
          ... pullRequest
        }
      }
    }

    fragment pullRequest on PullRequest {
      number
      author { login }
      comments(first: 50, after: $commentCursor) {
        pageInfo {
          endCursor
          hasNextPage
        }
        nodes {
          createdAt
          author { login }
          bodyText
        }
      }
    }
  """

  def treat_pr(pr):
    rows, variables = [], []
    number = pr['number']
    pr_author = pr['author']['login']
    for comment in pr['comments']['nodes']:
      date = pd.to_datetime(comment['createdAt']).tz_localize(None)
      body = comment['bodyText'][:500].replace('\n','\\n')
      rows.append([number,pr_author,date,comment['author']['login'],body])
    if pr['comments']['pageInfo']['hasNextPage']:
      variables += [{
          'single':True,
          'number':number,
          'commentCursor':pr['comments']['pageInfo']['endCursor']
      }]
    return rows, variables

  def rows_and_next_variables(data):
    if data is None:
      return [], [{'single':False,'number':0}]
    else:
      if 'search' in data:
        prs = data['search']
        rows, variables = [], []
        for pr in prs['nodes']:
          if 'number' in pr:
            new_rows, new_variables = treat_pr(pr)
            rows += new_rows
            variables += new_variables
        if prs['pageInfo']['hasNextPage']:
          variables += [{
              'single':False,
              'number':0,
              'prCursor':prs['pageInfo']['endCursor']
          }]
        return rows, variables
      else:
        return treat_pr(data['repository']['pullRequest'])

  requestAllPages(
      query,
      rows_and_next_variables,
      'pr_comments.csv',
      ['number','pr_author','date','author','body']
  )

# fetch_pr_comments()

## Data processing

In [6]:
def load_csv(filename):

  df = pd.read_csv(filename,parse_dates=True,index_col=2)
  print(f'File retrieved from local file system: {filename}')
  return df

In [7]:
pr_comments = load_csv('pr_comments.csv')

File retrieved from local file system: pr_comments.csv


Pre-processing: we exclude PRs opened by Jason Gross (mostly for testing the CI minimizer):

In [8]:
pr_comments = pr_comments[~ pr_comments['pr_author'].isin(['JasonGross'])]

Retrieve all comments that triggered the bug minimizer:

In [9]:
ci_minimize_comments = pr_comments[
    pr_comments['body'].str.contains('@coqbot:? [Cc][Ii][- ][Mm]inimize') &
    ~ pr_comments['author'].isin(['coqbot-app'])
  ].sort_values('number')
ci_minimize_comments

Unnamed: 0_level_0,number,pr_author,author,body
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-07-25 16:58:40,11966,olaure01,olaure01,@coqbot ci minimize
2021-06-19 19:57:50,12493,ppedrot,JasonGross,@coqbot ci minimize
2021-07-20 23:39:01,12493,ppedrot,JasonGross,Hopefully things work better this time @coqbot...
2021-06-21 01:37:25,12512,ppedrot,JasonGross,@coqbot ci minimize
2021-08-31 21:07:07,12512,ppedrot,ppedrot,@coqbot ci minimize
...,...,...,...,...
2021-10-05 17:13:30,14986,SkySkimmer,SkySkimmer,@coqbot ci minimize
2021-10-19 15:59:31,15048,SkySkimmer,SkySkimmer,@coqbot ci minimize ci-category_theory
2021-10-19 15:59:15,15048,SkySkimmer,SkySkimmer,@coqbot ci minimize ci-hott ci-relation_algebra
2021-10-19 14:16:41,15048,SkySkimmer,SkySkimmer,@coqbot ci minimize ci-hott


In [10]:
ci_minimize_triggerers = ci_minimize_comments.drop_duplicates(subset=['number', 'author']).sort_values('pr_author')
ci_minimize_triggerers

Unnamed: 0_level_0,number,pr_author,author,body
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-08-03 16:39:13,14740,Alizter,Zimmi48,This wasn't expected that this would break any...
2021-08-13 12:20:58,14777,Alizter,Alizter,@coqbot ci minimize
2021-10-19 15:59:31,15048,SkySkimmer,SkySkimmer,@coqbot ci minimize ci-category_theory
2021-10-05 17:13:30,14986,SkySkimmer,SkySkimmer,@coqbot ci minimize
2021-06-29 19:42:28,13107,SkySkimmer,JasonGross,"Minimization of ci-perennial should work now, ..."
2021-08-11 21:54:13,14758,SkySkimmer,Alizter,@coqbot ci minimize
2021-08-14 11:39:06,14785,SkySkimmer,SkySkimmer,@coqbot ci minimize
2021-10-09 01:37:39,14986,SkySkimmer,JasonGross,"I've hopefully fixed the import problem, let's..."
2021-08-14 07:59:42,14783,SkySkimmer,SkySkimmer,@coqbot ci minimize\n(error at https://github....
2021-06-10 17:23:25,14480,SkySkimmer,SkySkimmer,@coqbot ci minimize ci-iris


Let's focus only on cases where the CI minimization did produce a minimized file:

In [11]:
ci_minimize_results = pr_comments[pr_comments['body'].str.contains('Minimized File') & pr_comments['author'].isin(['coqbot-app'])].sort_values('number')
ci_minimize_results

Unnamed: 0_level_0,number,pr_author,author,body
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-07-25 17:04:47,11966,olaure01,coqbot-app,Minimized File /github/workspace/builds/coq/co...
2021-07-25 18:26:22,11966,olaure01,coqbot-app,Minimized File /github/workspace/builds/coq/co...
2021-07-25 22:15:29,11966,olaure01,coqbot-app,Minimized File /github/workspace/builds/coq/co...
2021-07-26 03:32:24,11966,olaure01,coqbot-app,Minimized File /github/workspace/builds/coq/co...
2021-07-26 05:49:12,11966,olaure01,coqbot-app,Minimized File /github/workspace/builds/coq/co...
...,...,...,...,...
2021-10-11 01:11:31,14986,SkySkimmer,coqbot-app,Minimized File /github/workspace/builds/coq/co...
2021-10-07 16:46:40,14986,SkySkimmer,coqbot-app,Minimized File /github/workspace/builds/coq/co...
2021-10-19 16:12:59,15048,SkySkimmer,coqbot-app,Minimized File /github/workspace/builds/coq/co...
2021-10-19 16:10:08,15048,SkySkimmer,coqbot-app,Minimized File /github/workspace/builds/coq/co...


In [12]:
minimized_prs = ci_minimize_results['number'].drop_duplicates()
len(minimized_prs)

32

We call the following "successful triggerers" but this is an overapproximation as it suffices that the CI minimizer was triggered successfully once in the PR, and that could have been by someone else in case several persons were involved in triggering it:

In [13]:
successful_triggerers = ci_minimize_triggerers[ci_minimize_triggerers['number'].isin(minimized_prs)]
successful_triggerers

Unnamed: 0_level_0,number,pr_author,author,body
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-08-03 16:39:13,14740,Alizter,Zimmi48,This wasn't expected that this would break any...
2021-08-13 12:20:58,14777,Alizter,Alizter,@coqbot ci minimize
2021-10-19 15:59:31,15048,SkySkimmer,SkySkimmer,@coqbot ci minimize ci-category_theory
2021-10-05 17:13:30,14986,SkySkimmer,SkySkimmer,@coqbot ci minimize
2021-06-29 19:42:28,13107,SkySkimmer,JasonGross,"Minimization of ci-perennial should work now, ..."
2021-08-11 21:54:13,14758,SkySkimmer,Alizter,@coqbot ci minimize
2021-08-14 11:39:06,14785,SkySkimmer,SkySkimmer,@coqbot ci minimize
2021-10-09 01:37:39,14986,SkySkimmer,JasonGross,"I've hopefully fixed the import problem, let's..."
2021-08-14 07:59:42,14783,SkySkimmer,SkySkimmer,@coqbot ci minimize\n(error at https://github....
2021-06-10 17:23:25,14480,SkySkimmer,SkySkimmer,@coqbot ci minimize ci-iris
