## Github Repository Process Metrics Extraction and Matching with Product Metrics Dataset

Data Sources: 
- [Bavota et al. (2016)](https://figshare.com/articles/An_Experimental_Investigation_on_the_Innate_Relationship_between_Quality_and_Refactoring/1207916)
- [Apache Ant Mirror Repository](https://github.com/apache/ant)

Paper References: 
- Tanaka D., Choi E., Yoshida N., Fujiwara K., Port D., Iida H. (20xx). An Investigation of the Relationship Between Extract Method and Process Metrics. The Institute of Electronics, Information and Communication Engineers.
- Kumar, L., & Sureka, A. (2017). Application of LSSVM and SMOTE on Seven Open Source Projects for Predicting Refactoring at Class Level. Asia-Pacific Software Engineering Conference (APSEC 2017), 90–99. https://doi.org/10.1109/APSEC.2017.15
- Bavota, G., De Lucia, A., Di Penta, M., Oliveto, R., & Palomba, F. (2015). An experimental investigation on the innate relationship between quality and refactoring. Journal of Systems and Software, 107, 1–14. https://doi.org/10.1016/j.jss.2015.05.024
- Lee, T., Nam, J., Han, D., Kim, S., & In, H. P. (2011). Micro Interaction Metrics for Defect Prediction. Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 5589(c), 311–321. https://doi.org/10.1145/2025113.2025156


In [1]:
import os
import pandas as pd
import subprocess

import seaborn as sns
import matplotlib.pyplot as plt
%pylab inline
from pylab import rcParams
rcParams['figure.figsize'] = 10,5

Populating the interactive namespace from numpy and matplotlib


In [4]:
# Data directories

data_dir = "data/raw/badsmells/data/"
projects = ["apache-ant", "xerces2-j"]

ddata_dir = "data/transformed/"

In [5]:
# Loading dataset and dropping duplicates
prod_df = pd.DataFrame()
for p in projects:
    p_dir = data_dir + p + "/" + p + "/"
    for f in os.listdir(p_dir):
        if f.endswith("metrics.csv"):
            df = pd.read_csv(p_dir + f, sep=";", index_col=False)
            df["proj"] = p
            prod_df = prod_df.append(df)
prod_df.drop_duplicates(inplace=True)

In [6]:
# Standardization of version names
dversions = []
for ver in prod_df["Version"]:
    dver = ""
    for i in ver.split("."):
        if len(i) == 2 and i[0] == "0":
            dver = dver + i[1] + "."
            continue
        dver = dver + i + "."
    dversions.append(dver.rstrip("."))
prod_df["Version"] = dversions
prod_df["Version"] = prod_df["Version"].replace(to_replace="1.8.0final", value="1.8.0")
prod_df["Version"].value_counts()

print("Available versions: {}".format(prod_df["Version"].unique()))
print("nVersions: {}".format(len(prod_df["Version"].unique())))

Available versions: ['1.1' '1.2' '1.3' '1.4' '1.4.1' '1.5' '1.5.1' '1.5.4' '1.6.0' '1.6.1'
 '1.6.2' '1.6.3' '1.6.4' '1.7.0' '1.7.1' '1.8.0' '1.8.1' '1.8.2' '1.0.0'
 '1.0.4' '1.2.0' '1.2.1' '1.2.2' '1.2.3' '1.3.0' '1.3.1' '1.4.0' '1.4.2'
 '1.4.3' '1.4.4' '2.0.0' '2.0.0alpha' '2.0.0beta' '2.0.0beta2' '2.0.0beta3'
 '2.0.0beta4' '2.0.1' '2.0.2' '2.1.0' '2.2.0' '2.2.1' '2.3.0' '2.4.0'
 '2.5.0' '2.6.0' '2.6.1' '2.6.2' '2.7.0' '2.7.1' '2.8.0' '2.8.1' '2.9.0']
nVersions: 52


In [7]:
prod_df["Version"].value_counts()

2.6.2         915
1.4.1         884
2.0.1         881
2.2.1         868
1.7.1         843
2.3.0         834
1.8.2         825
1.8.0         818
1.7.0         816
1.8.1         815
2.5.0         805
2.7.1         776
2.4.0         771
1.6.4         770
2.2.0         764
2.8.1         756
2.8.0         747
2.6.0         741
2.0.2         717
2.9.0         716
2.1.0         713
2.7.0         710
2.6.1         706
2.0.0beta3    704
2.0.0beta4    701
1.6.2         690
1.6.3         672
1.6.1         645
1.6.0         633
2.0.0beta     619
1.0.4         595
1.5.4         583
1.2.3         536
1.3.0         536
1.5.1         523
1.4.3         520
1.5           516
1.4.4         506
1.4.0         499
1.4.2         497
2.0.0beta2    490
1.3.1         488
1.2.0         487
1.2.1         483
1.2.2         467
2.0.0alpha    385
1.4           327
1.0.0         323
1.3           247
2.0.0         234
1.2           150
1.1           105
Name: Version, dtype: int64

In [8]:
xer_sel_versions = ['2.0.1', '2.0.2', '2.1.0', '2.2.0', '2.2.1', '2.3.0', '2.4.0', '2.5.0', '2.6.0', '2.6.1', '2.6.2', '2.7.0', '2.7.1']
ant_sel_versions = ['1.5', '1.5.1', '1.5.4', '1.6.0', '1.6.1', '1.6.2', '1.6.3', '1.6.4', '1.7.0', '1.7.1', '1.8.0', '1.8.1', '1.8.2']

### Process Metrics Extraction
Will only consider versions 1.5 onwards for stability. Revision split points
- 1:5
- 2:4
- 3:3
- 4:2
- 5:1

In [28]:
proj_df["Class"].unique()

array(['org.apache.xerces.dom.AttrImpl',
       'org.apache.xerces.dom.CharacterDataImpl',
       'org.apache.xerces.dom.DeepNodeListImpl', ...,
       'org.apache.xerces.impl.io.Latin1Reader',
       'org.apache.xerces.parsers.SoftReferenceSymbolTableConfiguration',
       'org.apache.xerces.util.SoftReferenceSymbolTable'], dtype=object)

In [36]:
%%time
proc_proj_df = pd.DataFrame()

for proj in projects:
    os.chdir("ghrepos/{}/".format(proj))
    
    proj_df = prod_df[prod_df["proj"] == proj]
    
    if proj == "xerces2-j":
        sel_versions = xer_sel_versions
        ver_prepend = "Xerces-J_"
        starts_w = ("dom", "dom3", "javax", "jaxp", "sax", "socket", "simplety", "ui", "util", "xni", "xs", "org")
        
    else:
        sel_versions = ant_sel_versions
        ver_prepend = "rel/"
        starts_w = ("main", "org")
        
    split_points = range(0, len(sel_versions), 2)
    proc_period = []
    ref_period = []
    for i in range(1, 6):
        proc_period.append(sel_versions[:split_points[i]+1])
        ref_period.append(sel_versions[split_points[i]+1:])

    info_cols = ['Refactoring', 'Version', 'Class']
    for split in range(len(proc_period)):
        print("===== SPLIT {} =====".format(split))
        # Get the start and end of process metrics period versions
        start_proc, end_proc = proc_period[split][0], proc_period[split][-1]
        if proj == "xerces2-j":
            start_proc, end_proc = start_proc.replace(".", "_"), end_proc.replace(".", "_")
        start_comm = subprocess.check_output('git rev-list -n 1 {}{}'.format(ver_prepend, start_proc), shell=True)[:8].decode("utf-8")
        end_comm = subprocess.check_output('git rev-list -n 1 {}{}'.format(ver_prepend, end_proc), shell=True)[:8].decode("utf-8")

        # Get the rows with version inside the ref period
        met_df = proj_df[proj_df["Version"].isin(proc_period[split])][info_cols]
        ref_df = proj_df[proj_df["Version"].isin(ref_period[split])]

        # For NC (Number of changes)
        print("==> Processing NC..")
        output = subprocess.check_output("git diff --name-only {}{} {}{}".format(ver_prepend, start_proc, ver_prepend, end_proc), shell=True)
        
        diff_classes_raw = [i for i in str(output).split("\\n") if i.endswith(".java") and i.lstrip("src/").startswith(starts_w)]
        diff_classes = [i.lstrip("src/").rstrip(".java").replace("/",".") for i in diff_classes_raw]
        
        diff_classes = set(diff_classes) & set(met_df["Class"].values)
        met_df["NC"] = met_df["Class"].apply(lambda x: 1 if x in diff_classes else 0)

        # For NDC (Number of Distinct Committers)  
        # For AG (Age of revision)
        print("==========================")
        print("==> Processing AG and NDC..")
        
        met_df["AG"] = 0
        met_df["NDC"] = 0
        for i in met_df["Class"].unique():
            class_str_name = "src/" + i.replace(".","/") + ".java"
            rth_commit = subprocess.check_output("git log {} --format=%ct {} | head -n1".format(end_comm, class_str_name), shell=True)
            first_commit = subprocess.check_output("git log --format=%ct {} | tail -1".format(class_str_name), shell=True)
            
            if first_commit != b'' and rth_commit != b'':
                dt = int(rth_commit) - int(first_commit)
                met_df.loc[met_df.Class==i, "AG"] = dt
            authors = subprocess.check_output("git shortlog -s {}{}...{}{} -- {}".format(ver_prepend, start_proc, ver_prepend, end_proc, class_str_name), shell=True)
            if authors != "b''":
                n_authors = len([i.strip().split("\t") for i in authors.decode("utf-8").strip("\n").split("\n") if len(i)>0])
                met_df.loc[met_df.Class==i, "NDC"] = n_authors

        # For ADD and DEL
        print("==> Processing ADD & DEL..")
        output = subprocess.check_output("git diff --numstat {} {} | grep -E '*.java'".format(start_comm, end_comm), shell=True)
        add_del_df = pd.DataFrame([entry.split("\\t") for entry in str(output).lstrip("b'").split("\\n")], columns=["ADD", "DEL", "Class"])
        add_del_df.dropna(axis=0, subset=["Class"], inplace=True)

        add_del_df = add_del_df[(add_del_df["Class"].str.startswith(tuple(["src/" + i for i in starts_w])))]
        add_del_df["Class"] = add_del_df["Class"].apply(lambda x: x.lstrip("src/").rstrip(".java").replace("/", "."))
        met_df = pd.merge(met_df, add_del_df, how="left", on="Class")

        # For CHURN
        print("==> Processing CHURN..")
        met_df["CHURN"] = met_df["ADD"] + met_df["DEL"]

        met_df["nsplit"] = split + 1

        new_df = pd.merge(met_df, ref_df, on=['Class', 'Refactoring'])
        proc_proj_df = proc_proj_df.append(new_df)

    os.chdir("../../")

===== SPLIT 0 =====
==> Processing NC..
==> Processing AG and NDC..
==> Processing ADD & DEL..
==> Processing CHURN..
===== SPLIT 1 =====
==> Processing NC..
==> Processing AG and NDC..
==> Processing ADD & DEL..
==> Processing CHURN..
===== SPLIT 2 =====
==> Processing NC..
==> Processing AG and NDC..
==> Processing ADD & DEL..
==> Processing CHURN..
===== SPLIT 3 =====
==> Processing NC..
==> Processing AG and NDC..
==> Processing ADD & DEL..
==> Processing CHURN..
===== SPLIT 4 =====
==> Processing NC..
==> Processing AG and NDC..
==> Processing ADD & DEL..
==> Processing CHURN..
===== SPLIT 0 =====
==> Processing NC..
==> Processing AG and NDC..
==> Processing ADD & DEL..
==> Processing CHURN..
===== SPLIT 1 =====
==> Processing NC..
==> Processing AG and NDC..
==> Processing ADD & DEL..
==> Processing CHURN..
===== SPLIT 2 =====
==> Processing NC..
==> Processing AG and NDC..
==> Processing ADD & DEL..
==> Processing CHURN..
===== SPLIT 3 =====
==> Processing NC..
==> Processing A

In [30]:
met_df

Unnamed: 0,Refactoring,Version,Class,NC,AG,NDC,ADD,DEL,CHURN,nsplit
0,not,2.0.1,org.apache.xerces.dom.AttributeMap,1,117753063,6,199,107,199107,5
1,not,2.0.1,org.apache.xerces.dom.AttrImpl,1,134799826,4,172,56,17256,5
2,not,2.0.1,org.apache.xerces.dom.AttrNSImpl,1,129091654,5,167,67,16767,5
3,not,2.0.1,org.apache.xerces.dom.CoreDocumentImpl,1,79595793,8,911,197,911197,5
4,not,2.0.1,org.apache.xerces.dom.DeferredAttrImpl,1,100647773,2,5,1,51,5
5,not,2.0.1,org.apache.xerces.dom.DeferredAttrNSImpl,1,94939601,3,8,24,824,5
6,not,2.0.1,org.apache.xerces.dom.DeferredDocumentImpl,1,126653784,3,276,85,27685,5
7,not,2.0.1,org.apache.xerces.dom.DocumentImpl,1,117393193,3,76,166,76166,5
8,not,2.0.1,org.apache.xerces.dom.DOMErrorImpl,1,45368745,2,28,42,2842,5
9,not,2.0.1,org.apache.xerces.dom.DOMImplementationImpl,1,133300140,5,51,152,51152,5


In [31]:
add_del_df

Unnamed: 0,ADD,DEL,Class
27,275,0,dom3.org.w3c.dom.Attr
28,54,0,dom3.org.w3c.dom.CDATASection
29,153,0,dom3.org.w3c.dom.CharacterDat
30,30,0,dom3.org.w3c.dom.Comment
31,414,0,dom3.org.w3c.dom.DOMConfiguration
32,86,0,dom3.org.w3c.dom.DOMError
33,45,0,dom3.org.w3c.dom.DOMErrorHandler
34,131,0,dom3.org.w3c.dom.DOMException
35,136,0,dom3.org.w3c.dom.DOMImplementation
36,43,0,dom3.org.w3c.dom.DOMImplementationList


In [32]:
xerces = proc_proj_df.dropna().drop_duplicates()

In [35]:
xerces["NC"].value_counts()

1    51533
Name: NC, dtype: int64

In [26]:
proc_proj_df

Unnamed: 0,Refactoring,Version_x,Class,NC,AG,NDC,ADD,DEL,CHURN,nsplit,...,NOC,RFC,CBO,LCOM,NOM,NOA,NOO,CCBC,C3,proj
0,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,110,3,105,15,8,7,27.912213,0.401079,xerces2-j
1,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,110,3,105,15,8,7,27.816749,0.401079,xerces2-j
2,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,110,3,105,15,8,7,28.103010,0.401079,xerces2-j
3,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,113,3,105,15,8,7,28.474890,0.400960,xerces2-j
4,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,113,3,105,15,8,7,28.408250,0.402277,xerces2-j
5,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,113,3,105,15,8,7,28.975578,0.402176,xerces2-j
6,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,113,3,105,15,8,7,29.033282,0.402176,xerces2-j
7,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,113,3,105,15,8,7,29.503330,0.402176,xerces2-j
8,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,113,3,105,15,8,7,29.927837,0.401113,xerces2-j
9,not,2.0.1,org.apache.xerces.dom.AttributeMap,0,69686388,3,,,,1,...,0,113,3,105,15,8,7,29.926663,0.401113,xerces2-j


In [37]:
proc_proj_df["proj"].value_counts()

xerces2-j     95512
apache-ant    87939
Name: proj, dtype: int64

In [89]:
os.chdir("../../")

In [38]:
proc_proj_df.to_csv(ddata_dir+"proc_prod.csv", index=False)

In [42]:
proc_proj_df[proc_proj_df["proj"]=="apache-ant"].to_csv(ddata_dir+"apache.csv", index=False)