# Data Extraction

This notebook contains all the steps necessary to fetch all bugs from the Apache Hive Jira online repository (available at [this address](https://issues.apache.org/jira/projects/HIVE/issues/HIVE-25351?filter=allopenissues)). We will begin by downloading the data to this repository before filtering it. 

## 1. Fetching Bug Reports from Jira

The first steps is to download and copy all the bug reports from *Hive 2.0.0* and subsequent versions. [On this page](https://issues.apache.org/jira/projects/HIVE/issues/HIVE-25351?filter=allopenissues), we can select `Advanced Search` and copy the following command :
```sql
project = HIVE AND issuetype = Bug AND status in (Resolved, Closed) AND affectedVersion = X.Y.Z
```
to fetch the bugs from a specific report. Bug reports for major and minor versions, as well as patches, can be downloaded.All of the bugs reports are kept in the `Jira_Bug_Data` folder, present in this repository.

## 2. Removing Redundant Bugs & Concatenating the Data 
Since a given bug may affect more than a single version of the software, some redundancy is present in the downloaded data. Although, we might not want to remove duplicates as we will find the affected files for a specific bug in multiple versions of the project. So, we will use pandas data frames to load all of the data from the bugs in the files before concatenating the bugs in a single file with their specific version number.

In [91]:
import pandas as pd
import os
import re
import glob

bug_dfs = []

bug_dfs = []

for file in glob.glob("Jira_Bug_Data/*.csv"):
    df = pd.read_csv(file)
    df = df.reset_index(drop=True)
    filename = os.path.basename(file)  # e.g., 'Hive_3.3.0_Jira_Bug_Data.csv'
    version_match = re.search(r'_(\d+\.\d+\.\d+)_', filename)
    if version_match:
        version = version_match.group(1)  # e.g., '3.3.0'
    else:
        version = 'Unknown'
    df['Version'] = version
    df = df[['Issue Type', 'Version']]
    bug_dfs.append(df)

concatenated_bug_dfs = pd.concat(bug_dfs, ignore_index=True)
concatenated_bug_dfs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2593 entries, 0 to 2592
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Issue Type  2593 non-null   object
 1   Version     2593 non-null   object
dtypes: object(2)
memory usage: 40.6+ KB


In [92]:
combined_bug_dfs = concatenated_bug_dfs.drop_duplicates(ignore_index=True)
combined_bug_dfs.keep(col)
combined_bug_dfs.info()

AttributeError: 'DataFrame' object has no attribute 'keep'

In [None]:
combined_bug_dfs.to_csv("Hive_Jira_Bug_Data.csv")

## 2.