# Setup & Imports

In [33]:
!pip install datasets



In [34]:
from datasets import load_dataset
import pandas as pd
from collections import Counter

# NLP
import nltk
from nltk import ngrams, FreqDist
from nltk.tokenize import word_tokenize
from typing import List, Tuple

In [35]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Exploratory Data Analysis

Since the task is to benchmark `Refact-1.6B` using `HumanEvalPack` for Python, I will limit myself to the Python portion of the dataset for the EDA aswell.

In [36]:
dataset = load_dataset("bigcode/commitpackft", "python")

In [37]:
dataset

DatasetDict({
    train: Dataset({
        features: ['commit', 'old_file', 'new_file', 'old_contents', 'new_contents', 'subject', 'message', 'lang', 'license', 'repos'],
        num_rows: 56025
    })
})

In [38]:
df = pd.DataFrame(dataset['train'])

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56025 entries, 0 to 56024
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   commit        56025 non-null  object
 1   old_file      56025 non-null  object
 2   new_file      56025 non-null  object
 3   old_contents  56025 non-null  object
 4   new_contents  56025 non-null  object
 5   subject       56025 non-null  object
 6   message       56025 non-null  object
 7   lang          56025 non-null  object
 8   license       56025 non-null  object
 9   repos         56025 non-null  object
dtypes: object(10)
memory usage: 4.3+ MB


In [40]:
df.head()

Unnamed: 0,commit,old_file,new_file,old_contents,new_contents,subject,message,lang,license,repos
0,e905334869af72025592de586b81650cb3468b8a,sentry/queue/client.py,sentry/queue/client.py,"""""""\nsentry.queue.client\n~~~~~~~~~~~~~~~~~~~\...","""""""\nsentry.queue.client\n~~~~~~~~~~~~~~~~~~~\...",Declare queues when broker is instantiated,Declare queues when broker is instantiated\n,Python,bsd-3-clause,"imankulov/sentry,BuildingLink/sentry,zenefits/..."
1,45fc612fdc5a354dbf0bacccd345b1aebcc73e59,tests/test_openweather.py,tests/test_openweather.py,# -*- coding: utf-8 -*-\nimport bot_mock\nfrom...,# -*- coding: utf-8 -*-\nimport bot_mock\nfrom...,"Revert ""Fix openweather unit tests""","Revert ""Fix openweather unit tests""\n\nThis re...",Python,bsd-3-clause,"rnyberg/pyfibot,EArmour/pyfibot,aapa/pyfibot,a..."
2,22faee82e1f070532c0dfe5777136e842233a1f0,src/dashboard/src/main/templatetags/percentage.py,src/dashboard/src/main/templatetags/percentage.py,"from django.template import Node, Library\n\nr...","from django.template import Node, Library\n\nr...","Fix % only showing 0 or 100%, everything betwe...","Fix % only showing 0 or 100%, everything betwe...",Python,agpl-3.0,"artefactual/archivematica-history,artefactual/..."
3,950ac9130bafe1fced578bf61d746b047830bfa0,automata/base/exceptions.py,automata/base/exceptions.py,"#!/usr/bin/env python3\n""""""Exception classes s...","#!/usr/bin/env python3\n""""""Exception classes s...","Remove ""validation"" from RejectionException do...","Remove ""validation"" from RejectionException do...",Python,mit,caleb531/automata
4,462ae981ed5b9cc9a8f46e97dfe7908c0827ea64,account_invoice_line_description/res_config.py,account_invoice_line_description/res_config.py,# -*- coding: utf-8 -*-\n#####################...,# -*- coding: utf-8 -*-\n#####################...,"Fix implied_group, it still refers to the old ...","Fix implied_group, it still refers to the old ...",Python,agpl-3.0,"Antiun/account-invoicing,hbrunn/account-invoic..."


## Python file types

Since I limited myself to Python, `lang` should only contain `Python`. Furthermore, I expect most of the filenames to end with `.py`. If they do not, they should be some sort of Python related configuration file.

In [41]:
assert(len(df['lang'].duplicated(keep=False)) == len(df))

In [42]:
def verify_file_types_in(columns: List[str], df: pd.DataFrame) -> pd.DataFrame:
  for col in columns:
    python_files_in_col = len(df['old_file'].str.endswith('.py'))
    non_python_files_in_col = len(df) - python_files_in_col
    print(f'Column `{col}` contains {python_files_in_col} filenames ending on ".py" and {non_python_files_in_col} filenames that do not end on ".py".')

verify_file_types_in(['old_file', 'new_file'], df)

Column `old_file` contains 56025 filenames ending on ".py" and 0 filenames that do not end on ".py".
Column `new_file` contains 56025 filenames ending on ".py" and 0 filenames that do not end on ".py".


Due to the fact that there are no files not ending on ".py" I conclude that the dataset does not include any auxiliary configuration files. This is a good sign for the use case of predicting method names during refactoring as configuration files would no be able to contribute directly.

## Difference between `subject` and `message` columns

Next, I will investigate if and how the `subject` and `message` columns differ.

In [43]:
df['subject_message_diff'] = [message.replace(subject, '') for subject, message in zip(df.subject, df.message)]
df['subject_message_diff']

0                                                       \n
1        \n\nThis reverts commit 36e100e649f0a337228a6d...
2           \n\n\nAutoconverted from SVN (revision:1548)\n
3                                                       \n
4                                                       \n
                               ...                        
56020                                                   \n
56021                                                   \n
56022                                                   \n
56023                                                   \n
56024                                                   \n
Name: subject_message_diff, Length: 56025, dtype: object

The key difference seems to be the explicit inclusion of line break delimiters. Let's verify this hunch.

In [44]:
df['subject_message_diff'].str.contains(r"\n").value_counts()

True     53951
False     2074
Name: subject_message_diff, dtype: int64

In most cases the only difference is the explicit inclusion of the line break delimiter. However, there are some cases where the difference is more pronounced. Let's take a look at these cases.

In [45]:
non_linebreak_subject_message_diff_df = df[~df['subject_message_diff'].str.contains(r"\n")].loc[:, ['subject', 'message', 'subject_message_diff']]
non_linebreak_subject_message_diff_df

Unnamed: 0,subject,message,subject_message_diff
9,Change the version of the package.,Change the version of the package.,
32,Update for compatibility with python 3,Update for compatibility with python 3,
47,Deal with MD and RST doc,[c] Deal with MD and RST doc,[c]
62,Add standard Ansible exception handling,Add standard Ansible exception handling,
77,Fix string formatting for NotRegistered exception,Fix string formatting for NotRegistered exception,
...,...,...,...
55909,Add configparser import to avoid windows packa...,Add configparser import to avoid windows packa...,
55943,Add a few subreddits to @r_wholesome,Add a few subreddits to @r_wholesome,
55957,Make dsamp a visible component of blimpy,Make dsamp a visible component of blimpy,
55979,Update key map to add 192,Update key map to add 192,


Most of the subject message differences **not** containing `\n` seem to be empty.

In [46]:
non_empty_subject_message_diffs = len(non_linebreak_subject_message_diff_df[non_linebreak_subject_message_diff_df['subject_message_diff'] != ''])
print(f'Of the entries not containing `\\n` {len(non_linebreak_subject_message_diff_df) - non_empty_subject_message_diffs} entries or {(1-(non_empty_subject_message_diffs/len(non_linebreak_subject_message_diff_df)))*100} percent are empty.')

Of the entries not containing `\n` 1986 entries or 95.75699132111862 percent are empty.


I conclude the the difference and thus additional information encoded in the `message` column is minor.

## N-grams of commit subjects (`message` column)

Because of the above result, I will perform this part of the analysis only on the `message` column. While the data in this column does contain some noise in  the form of explicit line break delimiters `\n`, some entries actually contain added information.

In [47]:
def __remove_line_break_escape_sequence(messages: pd.Series) -> pd.Series:
  return messages.str.replace('\\n', '', regex=True)

def __reduce_to_alphanumeric_and_whitespace(messages: pd.Series) -> pd.Series:
  return messages.str.replace(pat='[^a-zA-Z0-9\\s]', repl='', regex=True)

def clean_messages_in(messages: pd.Series) -> pd.Series:
  messages = __reduce_to_alphanumeric_and_whitespace(messages)
  messages = __remove_line_break_escape_sequence(messages)
  messages = messages.str.lower()

  return messages

In [48]:
messages = clean_messages_in(df['message'])

In [49]:
tokenized_messages = messages.apply(word_tokenize)

Now, I extract all n-grams for n={1,2,3,4,5} to investigate any sort of patterns present in the types of commits in the dataset. For this I will compute a probability distribution based on the counts of each n-gram I find.

In [50]:
def extract_ngrams(messages: pd.Series, n: int) -> List[Tuple[Tuple[str], float]]:
  """
  Extracts the n-grams specified by the n parameter from messages and computes their
  probabilitiy distribution based on the observed counts.

  Args:
    messages (pd.Series): A series of the tokenized messages, each row corresponds to one message and is a list of word tokens.
    n (int): Specifices the type of n-grams to extract.

  Returns:
    (List[Tuple[Tuple[str], float]]): A list of tuples where fst is a tuple of strings containing the n-gram and snd the probability
      for the occurrence of this n-gram.
  """
  # Extract the n_grams and flatten the resulting list
  extracted_ngrams = [ngram for tokenized_message in messages for ngram in list(ngrams(tokenized_message, n))]

  # Get the frequency distribution of the n_grams by counting
  freq_dist = FreqDist(extracted_ngrams)

  # Map frequency distribution to probabilities
  total_ngrams = len(extracted_ngrams)
  ngrams_with_probabilities = {gram: round(freq / total_ngrams, 6) for gram, freq in freq_dist.items()}

  # Sort n-grams by probabilities
  sorted_ngrams = sorted(ngrams_with_probabilities.items(), key=lambda x: x[1], reverse=True)

  return sorted_ngrams


n_grams = {}
for n in range(1, 6):
  n_grams[n] = extract_ngrams(tokenized_messages, n)

Let's take a look at the top 20 n-grams for each type of n-gram (ie, unigram, bigram, ...).

In [52]:
top_ngrams_cutoff = 20
for n in range(1,6):
  print(f"{top_ngrams_cutoff} most probable {n}-grams:")
  for m, gram in enumerate(n_grams[n][:top_ngrams_cutoff]):
    print(f"{m+1}: {gram}")
  print("\n\n")

20 most probable 1-grams:
1: (('to',), 0.042628)
2: (('add',), 0.041358)
3: (('for',), 0.03146)
4: (('the',), 0.028937)
5: (('a',), 0.016898)
6: (('in',), 0.015937)
7: (('of',), 0.013998)
8: (('fix',), 0.013983)
9: (('test',), 0.013618)
10: (('and',), 0.01225)
11: (('use',), 0.008624)
12: (('from',), 0.007086)
13: (('is',), 0.006732)
14: (('with',), 0.006596)
15: (('tests',), 0.006528)
16: (('on',), 0.006412)
17: (('remove',), 0.006081)
18: (('update',), 0.005789)
19: (('script',), 0.005623)
20: (('make',), 0.005207)



20 most probable 2-grams:
1: (('add', 'a'), 0.005199)
2: (('test', 'for'), 0.004151)
3: (('tests', 'for'), 0.003221)
4: (('script', 'to'), 0.002811)
5: (('add', 'test'), 0.002648)
6: (('in', 'the'), 0.002517)
7: (('to', 'the'), 0.002485)
8: (('for', 'the'), 0.002287)
9: (('instead', 'of'), 0.002113)
10: (('add', 'script'), 0.002015)
11: (('of', 'the'), 0.001984)
12: (('add', 'tests'), 0.001779)
13: (('to', 'be'), 0.001735)
14: (('to', 'use'), 0.001334)
15: (('script', '

# Analysis (Commits as a source of natural language instructions for code editing)

A common thread throughout these n-grams is that most of the commits in the dataset seem to concern tests. This could present a limitation for the use case presented as it means that models trained on the dataset would be biased towards refactoring testing suites. Furthermore, it remains to be seen whether these commits actually refactor code or just commit entire test methods as a whole.

Other common n-grams that are present as n increases are hinting at commits for code documentation and migration of code to different versions of used libraries. The latter might be especially helpful, as these tasks can be time-consuming, tediuous and difficult.

All in all the dataset is indeed of a good quality. It is however biased towards a certain set of tasks, especially considering that the dataset only contains ~56k entries to begin with, which is not a lot in the context of modern deep learning and the trajectory that it is taking (ie. more = better).

In summary, while I think that commits as natural language instructions for code editing is an interesting avenue to explore there are some key issues that I have identified by looking a this dataset:
- Biased towards test generation
- Might commit entire methods in some cases making it useless for code editing (this would be more in the avenue of code generation)
- Commits might not be detailed enough to really describe what the author changed in such a way that it is useful for a model
  - e.g., "Added some tests" or "Migrated to new version" require a lot more context to yield semantically meaningful results