# Data Extraction

This notebook contains all the steps necessary to fetch all bugs from the Apache Hive Jira online repository (available at [this address](https://issues.apache.org/jira/projects/HIVE/issues/HIVE-25351?filter=allopenissues)). We will begin by downloading the data to this repository before filtering it. 

## 1. Fetching Bug Reports from Jira

The first steps is to download and copy all the bug reports from *Hive 2.0.0* and subsequent versions. [On this page](https://issues.apache.org/jira/projects/HIVE/issues/HIVE-25351?filter=allopenissues), we can select `Advanced Search` and copy the following command :
```sql
project = HIVE AND issuetype = Bug AND status in (Resolved, Closed) AND affectedVersion = X.Y.Z
```
to fetch the bugs from a specific report. Bug reports for major and minor versions, as well as patches, can be downloaded.All of the bugs reports are kept in the `Jira_bug_data` folder, present in this repository.

## 2. Removing Redundant Bugs & Concatenating the Data 
Since a given bug may affect more than a single version of the software, some redundancy is present in the downloaded data. Although, we might not want to remove duplicates as we will find the affected files for a specific bug in multiple versions of the project. So, we will use pandas data frames to load all of the data from the bugs in the files before concatenating the bugs in a single file with their specific version number.

In [1]:
import pandas as pd
import os
import re
from pathlib import Path

In order to simplify repertory changes, we'll initialize two variables, containning the paths of this current repository and the path of your clone of the Apache Hive repertory

In [2]:
project_repo = "/home/nicolas-richard/Desktop/.Apache_Hive_Bug_Prediction_ML_Model/"
hive_repo = "/home/nicolas-richard/Desktop/.Apache_Hive/"

In [3]:
project_repo = Path(project_repo)
data_dir = project_repo.joinpath("Jira_bug_data")

version_pattern = re.compile(r'_(\d+\.\d+\.\d+)_')

bug_dfs = []

for file_path in data_dir.glob("*.csv"):
    
    df = pd.read_csv(file_path)

    filename = file_path.name  # e.g., 'Hive_3.3.0_Jira_Bug_Data.csv'

    match = version_pattern.search(filename)
    version = match.group(1) if match else 'Unknown'
    
    df['Version'] = version

    df = df[['Issue key', 'Version']]
   
    bug_dfs.append(df)

concatenated_bug_dfs = pd.concat(bug_dfs, ignore_index=True)

concatenated_bug_dfs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2133 entries, 0 to 2132
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Issue key  2133 non-null   object
 1   Version    2133 non-null   object
dtypes: object(2)
memory usage: 33.5+ KB


We have gathered 2133 bugs. We will run a simple check to remove duplicate lines

In [4]:
combined_bug_dfs = concatenated_bug_dfs.drop_duplicates()

combined_bug_dfs.info()

combined_bug_dfs.to_csv("Hive_Jira_Bug_Data.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2133 entries, 0 to 2132
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Issue key  2133 non-null   object
 1   Version    2133 non-null   object
dtypes: object(2)
memory usage: 33.5+ KB


Despite the above lines, some duplicate bug keys will remain, which is intentional. Those bugs affected more than a single version of the software. The below line of code outputs bugs affecting multiple versions

In [5]:
print(concatenated_bug_dfs['Issue key'].value_counts())

Issue key
HIVE-21873    6
HIVE-21009    6
HIVE-18284    5
HIVE-19101    5
HIVE-17429    5
             ..
HIVE-22360    1
HIVE-22375    1
HIVE-22514    1
HIVE-22537    1
HIVE-22124    1
Name: count, Length: 1771, dtype: int64


## 2. Identified Affected Java and C++ Files

The goal here will be, for every single bug ID collected above, to identify the java and C++ in a given version of the software. Make sure you have a cloned repo of the Apache Hive project on you computer and the path at hand.

In [6]:
import os
import csv
import subprocess

In [7]:
os.chdir(hive_repo)
subprocess.run(['git', 'checkout', "master"], check=True)
subprocess.run(['git', 'pull'], check=True)

bug_file_map = {}

with open(project_repo.joinpath("Hive_Jira_Bug_Data.csv"), 'r') as f:
    for line in f:
        bug_id, version = line.strip().split(',')
        bug_file_map[bug_id] = {'version': version, 'files': []}

for bug_id in bug_file_map:
    version = bug_file_map[bug_id]['version']
    print(f"Searching for bug ID: {bug_id} (Version: {version})")

    try:
        result = subprocess.run(
            ["git", "log", "--grep=" + bug_id, "--name-only", "--pretty=format:"],
            capture_output=True,
            text=True,
            check=True
        )

        all_files = result.stdout.splitlines()
        filtered_files = [file for file in all_files if file.endswith(('.cpp', '.java'))]

        if filtered_files:
            bug_file_map[bug_id]['files'].extend(filtered_files)

    except subprocess.CalledProcessError as e:
        print(f"Error searching for {bug_id}: {e}")

Already on 'master'


Your branch is up to date with 'origin/master'.
Already up to date.
Searching for bug ID: Issue key (Version: Version)
Searching for bug ID: HIVE-25382 (Version: 2.3.2)
Searching for bug ID: HIVE-22165 (Version: 2.1.0)
Searching for bug ID: HIVE-21873 (Version: 2.3.4)
Searching for bug ID: HIVE-21419 (Version: 2.3.2)
Searching for bug ID: HIVE-21081 (Version: 2.3.3)
Searching for bug ID: HIVE-21009 (Version: 2.1.0)
Searching for bug ID: HIVE-20816 (Version: 2.3.2)
Searching for bug ID: HIVE-20771 (Version: 3.1.0)
Searching for bug ID: HIVE-20693 (Version: 2.3.2)
Searching for bug ID: HIVE-20659 (Version: 3.1.0)
Searching for bug ID: HIVE-20574 (Version: 2.3.2)
Searching for bug ID: HIVE-20555 (Version: 3.1.0)
Searching for bug ID: HIVE-20318 (Version: 3.1.0)
Searching for bug ID: HIVE-20210 (Version: 2.3.1)
Searching for bug ID: HIVE-20153 (Version: 2.3.2)
Searching for bug ID: HIVE-20077 (Version: 2.3.2)
Searching for bug ID: HIVE-20039 (Version: 2.3.2)
Searching for bug ID: HIVE-1963

In [8]:
os.chdir(project_repo)
output_file = 'Hive_Modified_Files.csv'

try:
    with open(output_file, 'w+', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        for bug_id, data in bug_file_map.items():
            bug_id_str = str(bug_id)
            version_str = str(data['version'])
            files_str = ";".join(data['files'])
            writer.writerow([bug_id_str, version_str, files_str])
        print(f"Output has been written to {output_file}")
except Exception as e:
    print(f"An error occurred: {e}")

Output has been written to Hive_Modified_Files.csv


## 3. Gather Independant Variables for Each File in Each Version of Hive

We will now use  *SciTools Understand* to collect a plethora of independent variables that will eventually help us predict bugs. First though, we need to retrieve the very last commit before the release of each version of the software.

From the [Apache Hive tags](https://github.com/apache/hive/tags), we can fetch manually the commit for each of the software version 

In [7]:
hive_versions = {
    "2.0.0": "7f9f1fcb8697fb33f0edc2c391930a3728d247d7",
    "2.0.1": "e3cfeebcefe9a19c5055afdcbb00646908340694",
    "2.1.0": "9265bc24d75ac945bde9ce1a0999fddd8f2aae29",
    "2.1.1": "1af77bbf8356e86cabbed92cfa8cc2e1470a1d5c",
    "2.2.0": "da840b0f8fa99cab9f004810cd22abc207493cae",
    "2.3.0": "6f4c35c9e904d226451c465effdc5bfd31d395a0",
    "2.3.1": "7590572d9265e15286628013268b2ce785c6aa08",
    "2.3.2": "857a9fd8ad725a53bd95c1b2d6612f9b1155f44d",
    "2.3.3": "3f7dde31aed44b5440563d3f9d8a8887beccf0be",
    "2.3.4": "56acdd2120b9ce6790185c679223b8b5e884aaf2",
    "2.3.5": "76595628ae13b95162e77bba365fe4d2c60b3f29",
    "2.3.6": "2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f",
    "2.3.7": "cb213d88304034393d68cc31a95be24f5aac62b6",
    "2.3.8": "f1e87137034e4ecbe39a859d4ef44319800016d7",
    "2.3.9": "92dd0159f440ca7863be3232f3a683a510a62b9d",
    "2.3.10": "5160d3af392248255f68e41e1e0557eae4d95273",
    "3.0.0": "ce61711a5fa54ab34fc74d86d521ecaeea6b072a",
    "3.1.0": "bcc7df95824831a8d2f1524e4048dfc23ab98c19",
    "3.1.1": "f4e0529634b6231a0072295da48af466cf2f10b7",
    "3.1.2": "8190d2be7b7165effa62bd21b7d60ef81fb0e4af",
    "3.1.3": "4df4d75bf1e16fe0af75aad0b4179c34c07fc975",
    "4.0.0": "183f8cb41d3dbed961ffd27999876468ff06690c",
    "4.0.1": "3af4517eb8cfd9407ad34ed78a0b48b57dfaa264"
}

These different release dates will allow us to associate a commit to specific a version of Hive 

In [8]:
output_file = project_repo.joinpath("Hive_Last_Commits.csv")

with open(output_file, mode='w', newline='') as file:
    writer = csv.writer(file)

    writer.writerow(["Version", "Commit Hash"])

    for version, commit in hive_versions.items():
        writer.writerow([version, commit])

print("Derniers commits pour chaque version enregistrés dans 'Hive_Last_Commits.csv'")

Derniers commits pour chaque version enregistrés dans 'Hive_Last_Commits.csv'


Finally, we can collect UND data 

In [None]:
version_file = project_repo.joinpath("Hive_Last_Commits.csv")
und_base = project_repo.joinpath("UND_projects")
settings_file_path = project_repo.joinpath("settings.xml")
hive_data = project_repo.joinpath('UND_hive_data')


os.chdir(project_repo)

def run_command(command):
    try:
        subprocess.run(command, shell=True, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Command failed: {e.cmd}")

run_command(f'rm -rf {und_base} && mkdir {und_base}')
run_command(f'rm -rf {hive_data} && mkdir {hive_data}')


os.chdir(hive_repo)

def process_versions():
    with open(version_file, "r") as file:
        next(file)
        for line in file:
            if line.strip():
                version, commit_id = line.split(",")[0].strip(), line.split(",")[1].strip()              

                run_command(f"cd {hive_repo} && git checkout {commit_id}")
                print(f"Successfully checked out {commit_id} before {version}")

                
                und_project_path = f"{und_base}/UND_{version}.und"

                run_command(f"und create -languages java C++ {und_project_path}")

                destination_settings_file = f"{und_project_path}/settings.xml"

                if os.path.exists(destination_settings_file):
                    print(f"Removing existing settings.xml at {destination_settings_file}")
                    os.remove(destination_settings_file)

                run_command(f"cp {settings_file_path} {und_project_path}")

                # Redudancy here is to override SciTools Understand's automatic generation of files
                run_command(f"cp {settings_file_path} {destination_settings_file}")
                run_command(f"und settings -metricsOutputFile {hive_data.joinpath(f'UND_{version}.csv')} {und_project_path}")

                run_command(f"und add {hive_repo} {und_project_path}")
                run_command(f"cp {settings_file_path} {destination_settings_file}")
                run_command(f"und settings -metricsOutputFile {hive_data.joinpath(f'UND_{version}.csv')} {und_project_path}")

                run_command(f"und analyze {und_project_path}")
                run_command(f"cp {settings_file_path} {destination_settings_file}")
                run_command(f"und settings -metricsOutputFile {hive_data.joinpath(f'UND_{version}.csv')} {und_project_path}")

                run_command(f"und metrics {und_project_path}")
                
if __name__ == "__main__":
    process_versions()

Previous HEAD position was 3af4517eb8 Preparing for 4.0.1 release
HEAD is now at 7f9f1fcb86 HIVE-13032: Hive services need HADOOP_CLIENT_OPTS for proper log4j2 initialization (Prasanth Jayachandran reviewed by Sergey Shelukhin)


Successfully checked out 7f9f1fcb8697fb33f0edc2c391930a3728d247d7 before 2.0.0


Previous HEAD position was 7f9f1fcb86 HIVE-13032: Hive services need HADOOP_CLIENT_OPTS for proper log4j2 initialization (Prasanth Jayachandran reviewed by Sergey Shelukhin)
HEAD is now at e3cfeebcef Update release notes for Hive 2.0.1 RC1


Successfully checked out e3cfeebcefe9a19c5055afdcbb00646908340694 before 2.0.1


Previous HEAD position was e3cfeebcef Update release notes for Hive 2.0.1 RC1
HEAD is now at 9265bc24d7 RELEASE_NOTES


Successfully checked out 9265bc24d75ac945bde9ce1a0999fddd8f2aae29 before 2.1.0


Previous HEAD position was 9265bc24d7 RELEASE_NOTES
HEAD is now at 1af77bbf83 Release notes 2.1.1 (addendum)


Successfully checked out 1af77bbf8356e86cabbed92cfa8cc2e1470a1d5c before 2.1.1


Previous HEAD position was 1af77bbf83 Release notes 2.1.1 (addendum)
HEAD is now at da840b0f8f Missed file when preparing for release


Successfully checked out da840b0f8fa99cab9f004810cd22abc207493cae before 2.2.0


Previous HEAD position was da840b0f8f Missed file when preparing for release
HEAD is now at 6f4c35c9e9 Release Notes


Successfully checked out 6f4c35c9e904d226451c465effdc5bfd31d395a0 before 2.3.0


Previous HEAD position was 6f4c35c9e9 Release Notes
HEAD is now at 7590572d92 Release Notes
Previous HEAD position was 7590572d92 Release Notes
HEAD is now at 857a9fd8ad Release Notes
Previous HEAD position was 857a9fd8ad Release Notes
HEAD is now at 3f7dde31ae Preparing for 2.3.3 release


Successfully checked out 7590572d9265e15286628013268b2ce785c6aa08 before 2.3.1
Successfully checked out 857a9fd8ad725a53bd95c1b2d6612f9b1155f44d before 2.3.2
Successfully checked out 3f7dde31aed44b5440563d3f9d8a8887beccf0be before 2.3.3


Previous HEAD position was 3f7dde31ae Preparing for 2.3.3 release
HEAD is now at 56acdd2120 Preparing for 2.3.4 release
Previous HEAD position was 56acdd2120 Preparing for 2.3.4 release
HEAD is now at 76595628ae Preparing for 2.3.5 release.


Successfully checked out 56acdd2120b9ce6790185c679223b8b5e884aaf2 before 2.3.4
Successfully checked out 76595628ae13b95162e77bba365fe4d2c60b3f29 before 2.3.5


Previous HEAD position was 76595628ae Preparing for 2.3.5 release.
HEAD is now at 2c2fdd524e Updated release notes for 2.3.6
Previous HEAD position was 2c2fdd524e Updated release notes for 2.3.6
HEAD is now at cb213d8830 Updated release notes for 2.3.7 release


Successfully checked out 2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f before 2.3.6
Successfully checked out cb213d88304034393d68cc31a95be24f5aac62b6 before 2.3.7


Previous HEAD position was cb213d8830 Updated release notes for 2.3.7 release
HEAD is now at f1e8713703 Updated release notes for 2.3.8 release RC3
Previous HEAD position was f1e8713703 Updated release notes for 2.3.8 release RC3
HEAD is now at 92dd0159f4 Updated release notes for 2.3.9 release


Successfully checked out f1e87137034e4ecbe39a859d4ef44319800016d7 before 2.3.8
Successfully checked out 92dd0159f440ca7863be3232f3a683a510a62b9d before 2.3.9


Previous HEAD position was 92dd0159f4 Updated release notes for 2.3.9 release
HEAD is now at 5160d3af39 Updated release notes for 2.3.10 release RC1


Successfully checked out 5160d3af392248255f68e41e1e0557eae4d95273 before 2.3.10


Previous HEAD position was 5160d3af39 Updated release notes for 2.3.10 release RC1
HEAD is now at ce61711a5f Preparing for release 3.0.0 : Updated standalone-metastore's dependency on storage-api


Successfully checked out ce61711a5fa54ab34fc74d86d521ecaeea6b072a before 3.0.0


Previous HEAD position was ce61711a5f Preparing for release 3.0.0 : Updated standalone-metastore's dependency on storage-api
HEAD is now at bcc7df9582 HIVE-20227: Exclude glassfish javax.el dependency(Vineet Garg, reviewed by Ashutosh Chauhan)


Successfully checked out bcc7df95824831a8d2f1524e4048dfc23ab98c19 before 3.1.0


Previous HEAD position was bcc7df9582 HIVE-20227: Exclude glassfish javax.el dependency(Vineet Garg, reviewed by Ashutosh Chauhan)
HEAD is now at f4e0529634 Preparing for 3.1.1 release


Successfully checked out f4e0529634b6231a0072295da48af466cf2f10b7 before 3.1.1


Previous HEAD position was f4e0529634 Preparing for 3.1.1 release
HEAD is now at 8190d2be7b Updated Release Notes with a couple of issues that were added late to the release.


Successfully checked out 8190d2be7b7165effa62bd21b7d60ef81fb0e4af before 3.1.2


Previous HEAD position was 8190d2be7b Updated Release Notes with a couple of issues that were added late to the release.
HEAD is now at 4df4d75bf1 HIVE-25665: Checkstyle LGPL files must not be in the release sources/binaries (Peter Vary reviewed by Zoltan Haindrich) (#3063)


Successfully checked out 4df4d75bf1e16fe0af75aad0b4179c34c07fc975 before 3.1.3


Previous HEAD position was 4df4d75bf1 HIVE-25665: Checkstyle LGPL files must not be in the release sources/binaries (Peter Vary reviewed by Zoltan Haindrich) (#3063)
HEAD is now at 183f8cb41d Updating RELEASE_NOTES, NOTICE, README.md for 4.0.0


Successfully checked out 183f8cb41d3dbed961ffd27999876468ff06690c before 4.0.0
Successfully checked out 3af4517eb8cfd9407ad34ed78a0b48b57dfaa264 before 4.0.1


Previous HEAD position was 183f8cb41d Updating RELEASE_NOTES, NOTICE, README.md for 4.0.0
HEAD is now at 3af4517eb8 Preparing for 4.0.1 release
