### Data Loading

In [1]:
import pandas as pd
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999
pd.options.display.max_colwidth = 50
import numpy as np 

In [2]:
!pip install pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()



In [3]:
git_commit = spark.read.csv('../data/GIT_COMMITS.csv',header=True)
changes = spark.read.csv('../data/GIT_COMMITS_CHANGES.csv',header=True)
ref_miner = spark.read.csv('../data/REFACTORING_MINER.csv',header=True)

### GIT_COMMITS

This table reports the commit information retrieved from the git log. 193,152 rows.

1. COMMIT_HASH str: (PK) Hash that identifies the commit. We will use it to get information of the commit from other tables.

2. COMMIT_MESSAGE str: Message summarizing the changes introduced with the commit. This is what we will try to predict with our model.

3. PARENTS str: Commit hash of the previous commit. Information from previous commits may be helpful in finding a message for the commit by giving information on style or content.

In [4]:
git_commit.printSchema()

root
 |-- PROJECT_ID: string (nullable = true)
 |-- COMMIT_HASH: string (nullable = true)
 |-- COMMIT_MESSAGE: string (nullable = true)
 |-- AUTHOR: string (nullable = true)
 |-- AUTHOR_DATE: string (nullable = true)
 |-- AUTHOR_TIMEZONE: string (nullable = true)
 |-- COMMITTER: string (nullable = true)
 |-- COMMITER_DATE: string (nullable = true)
 |-- COMMITTER_TIMEZONE: string (nullable = true)
 |-- BRANCHES: string (nullable = true)
 |-- IN_MAIN_BRANCH: string (nullable = true)
 |-- MERGE: string (nullable = true)
 |-- PARENTS: string (nullable = true)



In [5]:
git_commit = git_commit.drop('PROJECT_ID')
git_commit = git_commit.drop('AUTHOR')
git_commit = git_commit.drop('AUTHOR_DATE')
git_commit = git_commit.drop('AUTHOR_TIMEZONE')
git_commit = git_commit.drop('COMMITTER')
git_commit = git_commit.drop('COMMITTER_DATE')
git_commit = git_commit.drop('COMMITTER_TIMEZONE')
git_commit = git_commit.drop('BRANCHES')
git_commit = git_commit.drop('IN_MAIN_BRANCH')
git_commit = git_commit.drop('MERGE')

In [6]:
git_commit.show()

+--------------------+--------------------+-------------+-------+
|         COMMIT_HASH|      COMMIT_MESSAGE|COMMITER_DATE|PARENTS|
+--------------------+--------------------+-------------+-------+
|52fc76012c5f96914...|New repository in...|         null|   null|
|           No Author|2000-10-01T07:37:01Z|    ['trunk']|   null|
|b1ff4af6abfec32fc...|    Initial revision|         null|   null|
|James Duncan Davi...|2000-10-01T07:37:01Z|    ['trunk']|   null|
|c8d7a13470987f892...|              Update|         null|   null|
|                null|                null|         null|   null|
|James Duncan Davi...|2000-10-01T07:40:39Z|    ['trunk']|   null|
|93a16402b48ae1cf7...|Added question li...|         null|   null|
|                null|                null|         null|   null|
|James Duncan Davi...|2000-10-01T08:15:04Z|    ['trunk']|   null|
|fcaecb541edc03f36...|      testing commit|         null|   null|
|        Dean Jackson|2000-10-02T13:33:11Z|    ['trunk']|   null|
|2ecc354fa

In [7]:
git_commit.count()

193152

In [8]:
git_commit.summary().show()

+-------+------------------+-----------------+--------------------+--------------------+
|summary|       COMMIT_HASH|   COMMIT_MESSAGE|       COMMITER_DATE|             PARENTS|
+-------+------------------+-----------------+--------------------+--------------------+
|  count|            158353|           154523|               81075|                8357|
|   mean|17309.403846153848|9555.553846153847|   7542.857142857143|                null|
| stddev|14023.909084145711|13707.12340668259|  13798.136520145445|                null|
|    min|                  |                 | Client B should ...|['000b111922ccb9c...|
|    25%|               3.0|              4.0|             -3600.0|                null|
|    50%|           18548.0|             10.0|             -3600.0|                null|
|    75%|           30580.0|          18640.0|             21600.0|                null|
|    max|            이종현|      zoom icons.|org.apache.felix....|                  []|
+-------+---------------

### GIT_COMMITS_CHANGES

This table contains the changes performed on each commit. 890,223 rows.

1. COMMIT_HASH str: (FK) Hash that identifies the commit. Used to join with other tables.

2. LINES_ADDED int: Number of new lines added in the commit. Predictive varibale. The mean number of lines added is of $39.10$, with a standrad deviation of $331.83$.

3. LINES_REMOVED int: Number of lines removed in the commit. Predictive varibale. The mean value of lines removed is of $25.04$ with a standard deviation of $276.42$.

4. FILE str: The full path to the modified file. Predictive varibale.

In [9]:
changes.printSchema()

root
 |-- PROJECT_ID: string (nullable = true)
 |-- FILE: string (nullable = true)
 |-- COMMIT_HASH: string (nullable = true)
 |-- DATE: string (nullable = true)
 |-- COMMITTER_ID: string (nullable = true)
 |-- LINES_ADDED: string (nullable = true)
 |-- LINES_REMOVED: string (nullable = true)
 |-- NOTE: string (nullable = true)



In [10]:
changes = changes.drop('PROJECT_ID')
changes = changes.drop('DATE')
changes = changes.drop('COMMITTER_ID')
changes = changes.drop('NOTE')

In [11]:
changes.show()

+--------------------+--------------------+-----------+-------------+
|                FILE|         COMMIT_HASH|LINES_ADDED|LINES_REMOVED|
+--------------------+--------------------+-----------+-------------+
|              README|b1ff4af6abfec32fc...|          2|            0|
|              README|c8d7a13470987f892...|          4|            1|
|              README|93a16402b48ae1cf7...|          3|            0|
|              README|fcaecb541edc03f36...|          1|            0|
|              README|2ecc354fa4f3209ad...|          0|            2|
|sources/org/apach...|49e6a3ce306eedc07...|         43|            0|
|sources/org/apach...|49e6a3ce306eedc07...|        147|            0|
|sources/org/apach...|49e6a3ce306eedc07...|        163|            0|
|sources/org/apach...|49e6a3ce306eedc07...|        765|            0|
|sources/org/apach...|49e6a3ce306eedc07...|         18|            0|
|sources/org/apach...|49e6a3ce306eedc07...|        141|            0|
|sources/org/apach..

In [12]:
changes.summary().show()

+-------+--------------------+--------------------+------------------+------------------+
|summary|                FILE|         COMMIT_HASH|       LINES_ADDED|     LINES_REMOVED|
+-------+--------------------+--------------------+------------------+------------------+
|  count|              857742|              857742|            857742|            857742|
|   mean|                null|                null| 39.10055587810787|25.042489466529563|
| stddev|                null|                null|331.82663785721223| 276.4222138288266|
|    min|         ,travis.yml|00016b9ca1063feea...|                 0|                 0|
|    25%|                null|                null|               1.0|               0.0|
|    50%|                null|                null|               3.0|               1.0|
|    75%|                null|                null|              18.0|               7.0|
|    max|zookeeper/zookeep...|fffffc63bcf57852c...|              9999|               999|
+-------+-

In [13]:
changes.count()

890223

### REFACTORING_MINER

This table reports the list of refactoring activities applied in the studied repositories. 37226 rows.

1. COMMIT_HASH str: (FK) Hash that identifies the commit. Used to join with other tables.

2. REFACTORING_TYPE str: One of the 15 types of refactoring detectable by Refactoring Miner. Predictive Variable.

3. REFACTORING_DETAIL str: Short description of the refactoring. Predictive Variable.

In [14]:
ref_miner.printSchema()

root
 |-- PROJECT_ID: string (nullable = true)
 |-- COMMIT_HASH: string (nullable = true)
 |-- REFACTORING_TYPE: string (nullable = true)
 |-- REFACTORING_DETAIL: string (nullable = true)



In [15]:
ref_miner = ref_miner.drop('PROJECT_ID')

In [16]:
ref_miner.show()

+--------------------+------------------+--------------------+
|         COMMIT_HASH|  REFACTORING_TYPE|  REFACTORING_DETAIL|
+--------------------+------------------+--------------------+
|adbabd6f8adad3f9d...|        Move Class|Move Class	org.ap...|
|23df647cf944b6c33...|        Move Class|Move Class	org.w3...|
|23df647cf944b6c33...|        Move Class|Move Class	org.w3...|
|23df647cf944b6c33...|        Move Class|Move Class	org.w3...|
|23df647cf944b6c33...|        Move Class|Move Class	org.w3...|
|0a2576bbf0225626c...|        Move Class|Move Class	org.w3...|
|0a2576bbf0225626c...|        Move Class|Move Class	org.w3...|
|0a2576bbf0225626c...|        Move Class|Move Class	org.w3...|
|9def38fe5050ce58c...|        Move Class|Move Class	org.ap...|
|9def38fe5050ce58c...| Pull Up Attribute|Pull Up Attribute...|
|9def38fe5050ce58c...|Extract Superclass|Extract Superclas...|
|9def38fe5050ce58c...|    Extract Method|Extract Method	pr...|
|9def38fe5050ce58c...|    Extract Method|Extract Method

In [17]:
ref_miner.count()

37226