<center><h1>Step 0 - Preprocessing</h1></center> 

In this section, we first read the data including the bug reports and source code files of all 12 projects and for ease of access, we save them as two pickle files in the ./Output directory. Therefore, this set of code will populate the ./Output directory with "allBugReports.pickle" which is a pandas dataframe that contains all the bug reports from all projects and "allSourceCodes.pickle" which is a pandas dataframe that contains all source files after preprocessing.

### Required Libraries

In [1]:
!pip install javalang



In [3]:
from __future__ import division
import pandas as pd
import numpy as np
import os
from os import listdir
from os.path import isfile, join
import warnings
import javalang
import re
import glob
import math
import time
import xml.etree.ElementTree as ET
import requests
import multiprocessing
from tqdm.notebook import tqdm as tq
from time import gmtime, strftime
from random import randint
warnings.simplefilter(action='ignore', category=FutureWarning)

<center><h1>Splitting code and natural language</h1></center> 

<center><h1>Loading source codes into pandas Dataframe</h1></center> 

In [3]:
def classNames_methodNames(node):
    result=''
    if isinstance(node,javalang.tree.MethodDeclaration) or isinstance(node,javalang.tree.ClassDeclaration):
        return node.name.lower()+' '
    if not (isinstance(node,javalang.tree.PackageDeclaration) or
        isinstance(node,javalang.tree.FormalParameter) or
       isinstance(node,javalang.tree.Import)):
        if node:
            if isinstance(node, javalang.ast.Node):
                for childNode in node.children:
                    result+=classNames_methodNames(childNode)
    return result
    
def traverse_node(node,i=0):
    i+=1
    result=''
    if not(isinstance(node,javalang.tree.PackageDeclaration)
            or isinstance(node,javalang.tree.FormalParameter)            
            or isinstance(node,javalang.tree.Import)
            or isinstance(node,javalang.tree.CompilationUnit)):
        if node:
            if (isinstance(node,int) or isinstance(node,str) or isinstance(node,float)) and i==2:
                result+=node+' '
            if isinstance(node, javalang.ast.Node):
                for childNode in node.children:
                    result+=traverse_node(childNode,i)
    return result

def code_parser(code):
    try:
        tree = javalang.parse.parse(code)
        return ''.join([traverse_node(node) for path, node in tree]) + ' ' + ''.join([classNames_methodNames(node)
                                                                                      for path, node in tree])
    except Exception as e: 
        print(e)
        return ''

def loadSourceFiles2df(PATH,project):
    """
    Receives: group name and project name 
    Process: open the source file directory and finds all the java files,
             and after preprocessing(using code_preprocessor) load them into a pandas dataframe 
    Returns: dataframe >> "filename","code","size"
    """
    print('Loading source files of {}  ...'.format(project))
    PATH=os.path.join("data",project,"gitrepo")
    all_source_files=glob.glob(PATH+'/**/*.java', recursive=True)
    source_codes_df=pd.DataFrame([])
    sourceCodesList=[]

    for filename in tq(all_source_files):
        code=open(filename,encoding='ISO-8859-1').read()
        if 'src/' in filename:
            sourceCodesList.append(dict({"filename":filename.split('src/')[1].replace('/','.').lower(),
                                         "unprocessed_code":code,
                                         'project':project}))
        else:
            sourceCodesList.append(dict({"filename":filename.split(project)[1].replace('/','.').lower(),
                                         "unprocessed_code":code,
                                         'project':project}))
    source_codes_df=source_codes_df.append(pd.DataFrame(sourceCodesList))
    return source_codes_df

def load_all_SCs(dataPath):
    print('\tLoading all source codes ... ')
    source_codes_df=pd.DataFrame([])
    all_projects= [folder for folder in listdir(dataPath)]
    for project in all_projects:
        source_path=os.path.join(dataPath,project,"gitrepo")
        source_codes_df=source_codes_df.append(loadSourceFiles2df(source_path,project))
    return source_codes_df

<center><h1>Loading bug reports pandas Dataframe</h1></center> 

In [4]:
def loadBugs2df(PATH,project):
    """
    @Receives: the path to bug repository (the xml file)
    @Process: Parses the xml file and reads the fix files per bug id. 
    @Returns: Returns the dataframe
    """
    print("Loading Bug reports ... ")
    all_bugs_df=pd.DataFrame([],columns=["id","fix","text","fixdate"])
    bugRepo = ET.parse(PATH).getroot()
    buglist=[]                   
    for bug in tq(bugRepo.findall('bug')):
        bugDict=dict({"id":bug.attrib['id'],"fix":[],"fixdate":bug.attrib['fixdate']
                      ,"summary":None,"description":None,"project":project,"average_precision":0.0})
        for bugDetail in bug.find('buginformation'):
            if bugDetail.tag=='summary':
                bugDict["summary"]=bugDetail.text
            elif bugDetail.tag=='description':
                bugDict["description"]=bugDetail.text
        bugDict["fix"]=np.array([fixFile.text.replace('/','.').lower() for fixFile in bug.find('fixedFiles')])
        summary=str(bugDict['summary']) if str(bugDict['summary']) !=np.nan else ""
        description=str(bugDict['description']) if str(bugDict['description']) !=np.nan else ""
        buglist.append(bugDict)
    all_bugs_df=all_bugs_df.append(pd.DataFrame(buglist))
    return all_bugs_df.set_index('id')

def load_all_BRs(dataPath):
    print('\tLoading all bug reports ... ')
    all_bugs_df=pd.DataFrame([])
    all_projects= [folder for folder in listdir(dataPath)]
    for project in all_projects:
        data_path=os.path.join(dataPath,project,"bugrepo","repository.xml")
        all_bugs_df=all_bugs_df.append(loadBugs2df(data_path,project))
        print(len(all_bugs_df))
    return all_bugs_df


<center><h1>Main Preprocessing class</h1></center> 

In [5]:
class PreprocessingUnit:
    all_projects_source_codes=pd.DataFrame([])
    all_projects_bugreports=pd.DataFrame([])
    
    def __init__(self,dataPath):
        self.dataPath=dataPath
        self.dataFolder=os.path.join(os.getcwd(),'Output')
        if not os.path.exists(self.dataFolder):
            os.makedirs(self.dataFolder)
            
    def execute(self):
        self.loadEverything()

    def loadEverything(self):
        vectorize=False
        if PreprocessingUnit.all_projects_bugreports.empty:
            bugReportFile=os.path.join(self.dataFolder,'allBugReports.pickle')
            if not os.path.isfile(bugReportFile):
                PreprocessingUnit.all_projects_bugreports=load_all_BRs(dataPath=self.dataPath)
                vectorize=True
                PreprocessingUnit.all_projects_bugreports.to_pickle(bugReportFile)
            else: 
                PreprocessingUnit.all_projects_bugreports=pd.read_pickle(bugReportFile)
        print("*** All bug reports are are preprocessed and stored as: {} ***".format('/'.join(bugReportFile.split('/')[-2:])))

        if PreprocessingUnit.all_projects_source_codes.empty:
            sourceCodeFile=os.path.join(self.dataFolder,'allSourceCodes.pickle')
            if not os.path.isfile(sourceCodeFile):
                PreprocessingUnit.all_projects_source_codes=load_all_SCs(dataPath=self.dataPath)
                vectorize=True
                PreprocessingUnit.all_projects_source_codes.to_pickle(sourceCodeFile)
            else:
                PreprocessingUnit.all_projects_source_codes=pd.read_pickle(sourceCodeFile)
        print("*** All source codes are preprocessed and stored as: {} ***".format('/'.join(sourceCodeFile.split('/')[-2:])))
        

### MAIN

In [6]:
if __name__=="__main__":
    config={'DATA_PATH':os.path.join('data')}
    preprocessor=PreprocessingUnit(dataPath=config['DATA_PATH'])
    preprocessor.execute()

*** All bug reports are are preprocessed and stored as: Output/allBugReports.pickle ***
*** All source codes are preprocessed and stored as: Output/allSourceCodes.pickle ***


In [13]:
def loadEverything():
    all_projects_bugreports = pd.read_pickle('Output/allBugReports.pickle')
    print("*** All Bug Reports are Loaded. ***")
    all_projects_source_codes = pd.read_pickle('Output/allSourceCodes.pickle')
    print("*** All Source Codes are Loaded. ***")
    return all_projects_bugreports, all_projects_source_codes

all_projects_bugreports, all_projects_source_codes = loadEverything()
display(all_projects_bugreports.iloc[1000])
display(all_projects_source_codes.iloc[1000])



*** All Bug Reports are Loaded. ***
*** All Source Codes are Loaded. ***


fix                  [org.apache.commons.lang.builder.equalsbuilder...
text                                                               NaN
fixdate                                            2008-01-13 07:00:40
summary              EqualsBuilder don&apos;t compare BigDecimals c...
description          When comparing a BigDecimal, the comparing is ...
project                                                           LANG
average_precision                                                  0.0
Name: 393, dtype: object

filename            main.java.org.springframework.security.authent...
unprocessed_code    /* Copyright 2004, 2005, 2006 Acegi Technology...
project                                                           SEC
Name: 428, dtype: object

# Problem

There are several software engineering (SE) problems that can be investigated using machine learning. Among them, we will be working on a problem called "Fault Localization" (FL). The goal of FL is to automatically locate a fault entity (e.g. a source file, a class, a method, etc) in source code. There are different variations of FL and we will focus on Information Retrieval based FL (IRFL). This article explains FL: https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=2530&context=sis_research. 

In short, the idea is, given a new bug report document, we want to automatically identify the source code file that most likely needs a fix, so we can save time for debugging. 

To do this, we may use the previous bug resports and identify the locations (files) that have been patched as our training set. So, we build an IRFL model that:

- Finds the textual similarity between the new bug report and the historical ones. 
- Then rank historically patched source files based on how similar their bug reports are to the new bug report.

In [66]:
# Key Imports
import re, string
import pytest

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from scipy.spatial import distance
from collections import defaultdict

# Data Pre-processing

First, we need to create our training set. Using `all_projects_bug_reports` and `all_projects_source_codes`.

We will clean the bug report and source code text by creating a function that:

1. Makes all text lowercase
2. Removes all punctuation from the text
3. Removes all repetitive white space from the text
4. Tokenizes the filtered string and removes stem words

Then, we will extract the features and labels of the bug report by:

1. Concatenating the bug summary and description, then using the 

In [None]:
def clean_text(text):
  """
  Processes the given text by changing all text to lowercase, removing punctuation, removing repetitive 
  whitespace, tokenizing words, and finally stemming all words.
  Return a string containing tokenized stem words separated by a single space.
  """
  # Change text to lower case
  text = text.lower()

  # Remove any and all punctuation
  text = text.translate(str.maketrans('', '', string.punctuation))

  # Remove any repetitive whitespace
  text = re.sub("\s\s+", " ", text)

  # Tokenize the words and remove all stop words
  tokenized = word_tokenize(text)
  tokens = []
  for token in tokenized:
    if token not in stopwords:
      tokens.append(token)

  # Stem all words using Porter Stemming
  stemmer = PorterStemmer()
  tokens = [stemmer.stem(token) for token in tokens]

  # Recreate the text
  return " ".join(tokens)

def process_bug_reports(bug_reports):
  clean_bug_reports = {
    "fix": [],
    "project": [],
    "text": [],
    "fixdate": []
  }
  for bug_report in bug_reports:
    # Concatenate the report's description and summary
    description = bug_report['description']
    summary = bug_report['summary']
    bug_text = ""
    if isinstance(description, str):
      bug_text += description
    if isinstance(summary, str):
      bug_text += summary

    # If the bug report is empty, we should not consider it.
    if bug_text == "":
      continue
    
    bug_text = clean_text(bug_text)

    # Append to the clean_bug_report dictionary
    clean_bug_reports['fix'].append(bug_report['fix'])
    clean_bug_reports['project'].append(bug_report['project'])
    clean_bug_reports['text'].append(bug_text)
    clean_bug_reports['fixdate'].append(bug_report['fixdate'])

  # Put the dict into a dataframe and return
  clean_bug_reports = pd.DataFrame.from_dict(clean_bug_reports)
  return clean_bug_reports

def process_source_files(source_files):
  clean_source_files = {
    "filename": [],
    "code": [],
    "project": []
  }
  for source_file in source_files:
    # Clean the source file's code
    clean_code = clean_text(source_file['unprocessed_code'])

    # Append to the clean_source_files dictionary
    clean_source_files['filename'] = source_file['filename']
    clean_source_files['code'] = clean_code
    clean_source_files['project'] = source_file['project']

  # Put the dict into a dataframe and return
  clean_source_files = pd.DataFrame.from_dict(clean_source_files)
  return clean_source_files

# Let's process the bug reports and source files

In [None]:
bugs = process_bug_reports(all_projects_bugreports)
source_files = process_source_files(all_projects_source_codes)

# Method 1
- You preprocess the data to have a clean dataset representing source files (including the buggy ones) and the bug reports. The exact preprocessing choices are ours to make.
- Next, apply the TF-IDF method to calculate the similarity between the new bug report (to locate) and the source code files directly. Unlike BugLocator, we ignore the historical bug reports in this step. The similarity function of Method 1 is called the direct relevancy function.
- Finally, we rank the source files based on their textual similarity to the new bug report and present the results using proper evaluation metrics (such as MAP and MRR).


# Calculating Similarities

Since our dataset has multiple projects, and each project has multiple bug reports, we don't want to compare bug reports with files of a project they don't belong to. So, we will compare a bug report to all the files in its respective project. 

To compute similarity between a bug report and it's project file:

1. Iterate through each file of the bug report's project.
2. Create a TF-IDF vectorizer and fit and transform the file's source code since we want to compare against the source code.
3. Transform the bug report's text with the vectorizer.
4. Compare the similarity of the two resulting vectors using cosine distance.

We will iterate through each bug report and generate its similarity, then return a list of similarities that will implicitely map to each bug report.

In [72]:
def calculate_similarity(bug_report, source_files):
  """
  Calculates the text of a bug report to the set of source files 
  WITHIN THE SAME PROJECT AS THE bug_report.
  """
  similarity = {
    "scores": [],
    "files": []
  }
  # For each file, we will calculate the similarity of it's source code to the bug report
  for file in source_files:
    vectorizer = TfidfVectorizer()
    source_vector = vectorizer.fit_transform([file['code']])
    bug_vector = vectorizer.transform([bug_report['text']])
    similarity_score = cosine_similarity(source_vector, bug_vector)[0][0]
    similarity['scores'].append(similarity_score)
    similarity['files'].append(file['filename'])

  return pd.DataFrame.from_dict(similarity)

def compute_similarities(bug_reports, source_files):
  # First, let's find the source files for each project
  project_files = defaultdict(list)
  for _, source_file in source_files.iterrows():
    project_files[source_file['project']].append(source_file)

  # Then, let's compute the similarities for each bug with its projects source files
  similarities = []
  for _, bug_report in bug_reports.iterrows():
    print(bug_report['project'])
    similarity = calculate_similarity(bug_report, project_files[bug_report['project']])
    similarities.append(similarity)

  return similarities

# Let's do a test
test_bugs = pd.DataFrame.from_dict({
  "fix": ["x"],
  "project": ["A"],
  "text": ["delta test coordinate"],
  "fixdate": ["a"],
})
test_files = pd.DataFrame.from_dict({
  "filename": ["x", "y"],
  "project": ["A", "A"],
  "code": ["delta test art", "move time coordinate"]
})
sims = compute_similarities(test_bugs, test_files)
print(sims)

A
[   scores files
0     1.0     x
1     0.0     y]


# Method 2

In this step, we will develop a new IRFL method and comparing to Method 1.

We will roughly implement the BugLocator tool. We will use the same preprocessing as TF-IDF code we developed for method 1 to calculate an indirect relevancy function. Then, we will use a weighted average of the direct relevancy function and indirect relevancy function to do the ranking for this method. The indirect function calculates the similarity between the new bug report and the historical ones. Then, given that we already know which exact files have been fixed for each historical bug report. So, we can map files to historical bug reports. Then, the algorithm ranks source files according to their indirect similarity (the similarity of a source file's corresponding historical report(s) to the new bug report) to the new bug report.

- Method 2 MUST improve method 1 results.
 

# Method 3

This is our brand new FL technique applicable on this dataset. The novel approach should use a machine learning/information retrieval method that is not taught in class. It is okay if the method is already proposed in the FL literature and is published, however, your code cannot be copy-pasted. This method does not need to outperform the other methods.