# MGL869 - Lab

*MGL869 ETS Montreal - Production engineering*

## Abstract

## Authors
- **Léo FORNOFF**
- **William PHAN**
- **Yannis OUAKRIM**

---

## Part 1 : Data collection

In [1]:
from Jira import jira_download
from pandas import Index
from numpy import ndarray


### 1.1 - Download Jira data
We download data if they are not already present in the data folder.

Return the dataframe of the data.

Query filter can be defined in config.ini

In [2]:
jira_dataframe = jira_download()

Downloading data from https://issues.apache.org/jira...
Filter = 'project=HIVE AND issuetype=Bug AND status in (Resolved, Closed) AND affectedVersion>= 2.0.0'
Fetching: 0 -> 999
Fetching: 1000 -> 1999
Fetching: 2000 -> 2999
Fetching: 3000 -> 3999
No more data to fetch.
All data downloaded and saved to data/jira_data/combined.csv


### 1.2 - Clean Jira data using pandas
Previously, we downloaded all the data from Jira. Now, we will clean the data using pandas.
We will keep only some colums and combine some columns.

In [3]:
keep: [str] = ['Issue key', 'Status', 'Resolution', 'Created', 'Fix Versions Combined', 'Affects Versions Combined']

In [4]:
affects_version_columns: [str] = [col for col in jira_dataframe.columns if col.startswith('Affects Version/s')]
jira_dataframe['Affects Versions Combined'] = jira_dataframe[affects_version_columns].apply(
    lambda x: ', '.join(x.dropna().astype(str)), axis=1
)

In [5]:
# Combine the versions into a single column
fix_version_columns: [str] = [col for col in jira_dataframe.columns if col.startswith('Fix Version/s')]

jira_dataframe['Fix Versions Combined'] = jira_dataframe[fix_version_columns].apply(
    lambda x: ', '.join(x.dropna().astype(str)), axis=1
)
jira_dataframe = jira_dataframe.loc[:, keep]

In [6]:
# Identify columns whose names contain the string 'Issue key'
issue_key_columns: Index = jira_dataframe.columns[jira_dataframe.columns.str.contains('Issue key')]
# Extract the values from these columns as a NumPy array
issue_key_values: ndarray = jira_dataframe[issue_key_columns].values
# Flatten the array to create a one-dimensional list of all 'Issue key' values
flattened_issue_keys: ndarray = issue_key_values.flatten()
# Convert the list into a set to remove duplicates
ids: set = set(flattened_issue_keys)

---


## Part 2 : Repository analysis


In [7]:
from Hive import git_download, commit_analysis, update_commit_dataframe, filter_versions_by_min
from git import Repo, Tag
from pandas import DataFrame
from configparser import ConfigParser
from re import compile
from packaging import version  

### 2.1 - Clone repository

In [8]:
repo: Repo = git_download()

data/hive_data/hiveRepo True
Creating the directory: data/hive_data/hiveRepo
Cloning the repository: https://github.com/apache/hive.git


In [9]:
all_couples = commit_analysis(ids)

20524 couples found.


### 2.2 - Filter data

In [10]:
commit_dataframe: DataFrame = DataFrame(all_couples, columns=["Issue key", "File", "Commit"])

In [11]:
# Languages without whitespaces
config: ConfigParser = ConfigParser()
config.read("config.ini")
languages: [str] = config["GENERAL"]["Languages"].split(",")
languages: [str] = [lang.strip() for lang in languages]
commit_dataframe: DataFrame = commit_dataframe[commit_dataframe['File'].str.endswith(tuple(languages))]

In [12]:
couples = update_commit_dataframe(commit_dataframe, jira_dataframe)
couples

Unnamed: 0,Issue key,File,Version Affected
0,HIVE-21614,ql/src/test/org/apache/hadoop/hive/metastore/T...,"2.3.4, 3.0.0"
1,HIVE-21614,standalone-metastore/metastore-server/src/main...,"2.3.4, 3.0.0"
2,HIVE-21614,standalone-metastore/metastore-server/src/main...,"2.3.4, 3.0.0"
3,HIVE-28366,iceberg/iceberg-handler/src/main/java/org/apac...,4.0.0
4,HIVE-28366,iceberg/iceberg-handler/src/main/java/org/apac...,4.0.0
...,...,...,...
10268,HIVE-13725,hcatalog/streaming/src/java/org/apache/hive/hc...,"1.2.1, 2.0.0"
10269,HIVE-13725,metastore/src/java/org/apache/hadoop/hive/meta...,"1.2.1, 2.0.0"
10270,HIVE-13725,ql/src/java/org/apache/hadoop/hive/ql/lockmgr/...,"1.2.1, 2.0.0"
10271,HIVE-13725,ql/src/test/org/apache/hadoop/hive/ql/TestTxnC...,"1.2.1, 2.0.0"


### 2.3 - Extract filter versions from git

In [13]:
releases_regex: [str] = config["GIT"]["ReleasesRegex"].split(",")
tags: Tag = repo.tags
versions: dict = {tag.name: tag.commit for tag in tags}
releases_regex: [str] = [regex.strip() for regex in releases_regex]
releases_regex = [compile(regex) for regex in releases_regex]

In [14]:
filtered_versions = filter_versions_by_min(versions, releases_regex, "1.0")
filtered_versions

{'4.0.1': <git.Commit "3af4517eb8cfd9407ad34ed78a0b48b57dfaa264">,
 '2.3.10': <git.Commit "5160d3af392248255f68e41e1e0557eae4d95273">,
 '4.0.0': <git.Commit "183f8cb41d3dbed961ffd27999876468ff06690c">,
 '3.1.3': <git.Commit "4df4d75bf1e16fe0af75aad0b4179c34c07fc975">,
 '2.3.9': <git.Commit "92dd0159f440ca7863be3232f3a683a510a62b9d">,
 '2.3.8': <git.Commit "f1e87137034e4ecbe39a859d4ef44319800016d7">,
 '2.3.7': <git.Commit "cb213d88304034393d68cc31a95be24f5aac62b6">,
 '3.1.2': <git.Commit "8190d2be7b7165effa62bd21b7d60ef81fb0e4af">,
 '2.3.6': <git.Commit "2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f">,
 '2.3.5': <git.Commit "76595628ae13b95162e77bba365fe4d2c60b3f29">,
 '2.3.4': <git.Commit "56acdd2120b9ce6790185c679223b8b5e884aaf2">,
 '3.1.1': <git.Commit "f4e0529634b6231a0072295da48af466cf2f10b7">,
 '3.1.0': <git.Commit "bcc7df95824831a8d2f1524e4048dfc23ab98c19">,
 '3.0.0': <git.Commit "ce61711a5fa54ab34fc74d86d521ecaeea6b072a">,
 '2.3.3': <git.Commit "3f7dde31aed44b5440563d3f9d8a8887beccf0

## Part 3. - Understand analysis

In [15]:
from Understand.commands import und_create_command, und_analyze_command, und_metrics_command, und_purge_command
from Understand.metrics import metrics
from Understand.label import label_all_metrics
from os import path

### 3.1 - Create the Understand project


In [16]:
hive_git_directory: str = config["GIT"]["HiveGitDirectory"]
data_directory: str = config["GENERAL"]["DataDirectory"]
understand_project_name : str = config["UNDERSTAND"]["UnderstandProjectName"]

understand_project_path : str = path.join(data_directory, hive_git_directory, understand_project_name)

if not path.exists(understand_project_path):
    und_create_command()

Running command : 
     /Applications/Understand.app/Contents/MacOS/und create -db data/hive_data/hive.und -languages Java c++



In [17]:
und_purge_command()

Running command : 
     /Applications/Understand.app/Contents/MacOS/und purge -db data/hive_data/hive.und
Database purged.



### 3.2 - Metrics extraction


In [18]:
metrics(filtered_versions)

Metrics analysis is skipped as per configuration.


### 3.3 - Labeling


In [19]:
label_all_metrics(couples)

Labelization process is skipped as per configuration.


## Part 4. - Dynamic Metrics

In [20]:
from Dynamic import collect_dynamic_metrics
from Hive import sort_filtered_versions

In [21]:
versions_since_1_2 = filter_versions_by_min(versions, releases_regex, "1.2.2")
ordered_versions = sort_filtered_versions(versions_since_1_2)
ordered_versions

{'1.2.2': <git.Commit "395368fc6478c7e2a1e84a5a2a8aac45e4399a9e">,
 '2.0.0': <git.Commit "7f9f1fcb8697fb33f0edc2c391930a3728d247d7">,
 '2.0.1': <git.Commit "e3cfeebcefe9a19c5055afdcbb00646908340694">,
 '2.1.0': <git.Commit "9265bc24d75ac945bde9ce1a0999fddd8f2aae29">,
 '2.1.1': <git.Commit "1af77bbf8356e86cabbed92cfa8cc2e1470a1d5c">,
 '2.2.0': <git.Commit "da840b0f8fa99cab9f004810cd22abc207493cae">,
 '2.3.0': <git.Commit "6f4c35c9e904d226451c465effdc5bfd31d395a0">,
 '2.3.1': <git.Commit "7590572d9265e15286628013268b2ce785c6aa08">,
 '2.3.2': <git.Commit "857a9fd8ad725a53bd95c1b2d6612f9b1155f44d">,
 '2.3.3': <git.Commit "3f7dde31aed44b5440563d3f9d8a8887beccf0be">,
 '2.3.4': <git.Commit "56acdd2120b9ce6790185c679223b8b5e884aaf2">,
 '2.3.5': <git.Commit "76595628ae13b95162e77bba365fe4d2c60b3f29">,
 '2.3.6': <git.Commit "2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f">,
 '2.3.7': <git.Commit "cb213d88304034393d68cc31a95be24f5aac62b6">,
 '2.3.8': <git.Commit "f1e87137034e4ecbe39a859d4ef44319800016d

In [22]:
dynamic_metrics = collect_dynamic_metrics(ordered_versions, repo,"2.0.0", 1)

Processing commits between 2.0.0 (7f9f1fcb8697fb33f0edc2c391930a3728d247d7) and 2.0.1 (e3cfeebcefe9a19c5055afdcbb00646908340694)


In [23]:
dynamic_metrics

{'2.0.1': {'count_lines': {'MiniHS2.java': {'LinesAdded': 49,
    'LinesDeleted': 29},
   'TestJdbcWithLocalClusterSpark.java': {'LinesAdded': 1, 'LinesDeleted': 1},
   'TestJdbcWithMiniMr.java': {'LinesAdded': 1, 'LinesDeleted': 1},
   'TestMultiSessionsHS2WithLocalClusterSpark.java': {'LinesAdded': 3,
    'LinesDeleted': 4},
   'TestSSL.java': {'LinesAdded': 61, 'LinesDeleted': 10},
   'TestHS2AuthzContext.java': {'LinesAdded': 2, 'LinesDeleted': 2},
   'TestJdbcMetadataApiAuth.java': {'LinesAdded': 1, 'LinesDeleted': 1},
   'TestJdbcWithSQLAuthorization.java': {'LinesAdded': 1, 'LinesDeleted': 1},
   'HiveConnection.java': {'LinesAdded': 45, 'LinesDeleted': 54},
   'HiveAuthFactory.java': {'LinesAdded': 104, 'LinesDeleted': 57},
   'DatabaseConnection.java': {'LinesAdded': 8, 'LinesDeleted': 4},
   'HiveConf.java': {'LinesAdded': 6, 'LinesDeleted': 3},
   'LowLevelLrfuCachePolicy.java': {'LinesAdded': 6, 'LinesDeleted': 3},
   'OrcMetadataCache.java': {'LinesAdded': 24, 'LinesDelete

In [24]:
import pickle

# Enregistrer l'objet dynamic_metrics dans un fichier
with open("dynamic_metrics.pkl", "wb") as file:
    pickle.dump(dynamic_metrics, file)