# MGL869 - Lab

*MGL869 ETS Montreal - Production engineering*

## Abstract

## Authors
- **Léo FORNOFF**
- **William PHAN**
- **Yannis OUAKRIM**

---

## Part 1 : Data collection

In [1]:
from Jira import jira_download
from pandas import Index
from numpy import ndarray


### 1.1 - Download Jira data
We download data if they are not already present in the data folder.

Return the dataframe of the data.

Query filter can be defined in config.ini

In [2]:
jira_dataframe = jira_download()

Data already downloaded
Filter = 'project=HIVE AND issuetype=Bug AND status in (Resolved, Closed) AND affectedVersion>= 2.0.0'


### 1.2 - Clean Jira data using pandas
Previously, we downloaded all the data from Jira. Now, we will clean the data using pandas.
We will keep only some colums and combine some columns.

In [3]:
keep: [str] = ['Issue key', 'Status', 'Resolution', 'Created', 'Fix Versions Combined', 'Affects Versions Combined']

In [4]:
affects_version_columns: [str] = [col for col in jira_dataframe.columns if col.startswith('Affects Version/s')]
jira_dataframe['Affects Versions Combined'] = jira_dataframe[affects_version_columns].apply(
    lambda x: ', '.join(x.dropna().astype(str)), axis=1
)

In [5]:
# Combine the versions into a single column
fix_version_columns: [str] = [col for col in jira_dataframe.columns if col.startswith('Fix Version/s')]

jira_dataframe['Fix Versions Combined'] = jira_dataframe[fix_version_columns].apply(
    lambda x: ', '.join(x.dropna().astype(str)), axis=1
)
jira_dataframe = jira_dataframe.loc[:, keep]

In [6]:
# Identify columns whose names contain the string 'Issue key'
issue_key_columns: Index = jira_dataframe.columns[jira_dataframe.columns.str.contains('Issue key')]
# Extract the values from these columns as a NumPy array
issue_key_values: ndarray = jira_dataframe[issue_key_columns].values
# Flatten the array to create a one-dimensional list of all 'Issue key' values
flattened_issue_keys: ndarray = issue_key_values.flatten()
# Convert the list into a set to remove duplicates
ids: set = set(flattened_issue_keys)

---


## Part 2 : Repository analysis


In [7]:
from Hive import git_download, commit_analysis, update_commit_dataframe, filter_versions_by_min
from git import Repo, Tag
from pandas import DataFrame
from configparser import ConfigParser
from re import compile
from packaging import version  

### 2.1 - Clone repository

In [8]:
repo: Repo = git_download()

Output/hive_data/hiveRepo False
Pulling the repository: https://github.com/apache/hive.git


In [9]:
all_couples = commit_analysis(ids)

20524 couples found.


### 2.2 - Filter data

In [10]:
commit_dataframe: DataFrame = DataFrame(all_couples, columns=["Issue key", "File", "Commit"])

In [11]:
# Languages without whitespaces
config: ConfigParser = ConfigParser()
config.read("config.ini")
languages: [str] = config["GENERAL"]["Languages"].split(",")
languages: [str] = [lang.strip() for lang in languages]
commit_dataframe: DataFrame = commit_dataframe[commit_dataframe['File'].str.endswith(tuple(languages))]

In [12]:
couples = update_commit_dataframe(commit_dataframe, jira_dataframe)
couples

Unnamed: 0,Issue key,File,Version Affected
0,HIVE-21614,ql/src/test/org/apache/hadoop/hive/metastore/T...,"2.3.4, 3.0.0"
1,HIVE-21614,standalone-metastore/metastore-server/src/main...,"2.3.4, 3.0.0"
2,HIVE-21614,standalone-metastore/metastore-server/src/main...,"2.3.4, 3.0.0"
3,HIVE-28366,iceberg/iceberg-handler/src/main/java/org/apac...,4.0.0
4,HIVE-28366,iceberg/iceberg-handler/src/main/java/org/apac...,4.0.0
...,...,...,...
10268,HIVE-17706,itests/util/src/main/java/org/apache/hadoop/hi...,3.0.0
10269,HIVE-17706,itests/util/src/main/java/org/apache/hive/beel...,3.0.0
10270,HIVE-17706,itests/util/src/main/java/org/apache/hive/beel...,3.0.0
10271,HIVE-17633,itests/util/src/main/java/org/apache/hadoop/hi...,3.0.0


### 2.3 - Extract filter versions from git

In [13]:
releases_regex: [str] = config["GIT"]["ReleasesRegex"].split(",")
tags: Tag = repo.tags
versions: dict = {tag.name: tag.commit for tag in tags}
releases_regex: [str] = [regex.strip() for regex in releases_regex]
releases_regex = [compile(regex) for regex in releases_regex]

In [14]:
filtered_versions = filter_versions_by_min(versions, releases_regex, "2.0.0")
filtered_versions

{'4.0.1': <git.Commit "3af4517eb8cfd9407ad34ed78a0b48b57dfaa264">,
 '2.3.10': <git.Commit "5160d3af392248255f68e41e1e0557eae4d95273">,
 '4.0.0': <git.Commit "183f8cb41d3dbed961ffd27999876468ff06690c">,
 '3.1.3': <git.Commit "4df4d75bf1e16fe0af75aad0b4179c34c07fc975">,
 '2.3.9': <git.Commit "92dd0159f440ca7863be3232f3a683a510a62b9d">,
 '2.3.8': <git.Commit "f1e87137034e4ecbe39a859d4ef44319800016d7">,
 '2.3.7': <git.Commit "cb213d88304034393d68cc31a95be24f5aac62b6">,
 '3.1.2': <git.Commit "8190d2be7b7165effa62bd21b7d60ef81fb0e4af">,
 '2.3.6': <git.Commit "2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f">,
 '2.3.5': <git.Commit "76595628ae13b95162e77bba365fe4d2c60b3f29">,
 '2.3.4': <git.Commit "56acdd2120b9ce6790185c679223b8b5e884aaf2">,
 '3.1.1': <git.Commit "f4e0529634b6231a0072295da48af466cf2f10b7">,
 '3.1.0': <git.Commit "bcc7df95824831a8d2f1524e4048dfc23ab98c19">,
 '3.0.0': <git.Commit "ce61711a5fa54ab34fc74d86d521ecaeea6b072a">,
 '2.3.3': <git.Commit "3f7dde31aed44b5440563d3f9d8a8887beccf0

In [15]:
from packaging.version import Version

sorted_versions = dict(
    sorted(filtered_versions.items(), key=lambda item: Version(item[0]), reverse=True)
)

sorted_versions

{'4.0.1': <git.Commit "3af4517eb8cfd9407ad34ed78a0b48b57dfaa264">,
 '4.0.0': <git.Commit "183f8cb41d3dbed961ffd27999876468ff06690c">,
 '3.1.3': <git.Commit "4df4d75bf1e16fe0af75aad0b4179c34c07fc975">,
 '3.1.2': <git.Commit "8190d2be7b7165effa62bd21b7d60ef81fb0e4af">,
 '3.1.1': <git.Commit "f4e0529634b6231a0072295da48af466cf2f10b7">,
 '3.1.0': <git.Commit "bcc7df95824831a8d2f1524e4048dfc23ab98c19">,
 '3.0.0': <git.Commit "ce61711a5fa54ab34fc74d86d521ecaeea6b072a">,
 '2.3.10': <git.Commit "5160d3af392248255f68e41e1e0557eae4d95273">,
 '2.3.9': <git.Commit "92dd0159f440ca7863be3232f3a683a510a62b9d">,
 '2.3.8': <git.Commit "f1e87137034e4ecbe39a859d4ef44319800016d7">,
 '2.3.7': <git.Commit "cb213d88304034393d68cc31a95be24f5aac62b6">,
 '2.3.6': <git.Commit "2c2fdd524e8783f6e1f3ef15281cc2d5ed08728f">,
 '2.3.5': <git.Commit "76595628ae13b95162e77bba365fe4d2c60b3f29">,
 '2.3.4': <git.Commit "56acdd2120b9ce6790185c679223b8b5e884aaf2">,
 '2.3.3': <git.Commit "3f7dde31aed44b5440563d3f9d8a8887beccf0

## Part 3. - Understand analysis

In [16]:
from Understand.commands import und_create_command, und_purge_command
from Understand.metrics import metrics
from Understand.label import label_all_metrics
from os import path
from Understand import merge_static_metrics
from Understand.enrich import enrich_metrics
from Understand.update import merge_all_metrics

### 3.1 - Create the Understand project


In [17]:
hive_git_directory: str = config["GIT"]["HiveGitDirectory"]
data_directory: str = config["GENERAL"]["DataDirectory"]
understand_project_name : str = config["UNDERSTAND"]["UnderstandProjectName"]

understand_project_path : str = path.join(data_directory, hive_git_directory, understand_project_name)

if not path.exists(understand_project_path):
    und_create_command()

In [18]:
und_purge_command()

Running command : 
     /Applications/Understand.app/Contents/MacOS/und purge -db Output/hive_data/hive.und
Database purged.



### 3.2 - Metrics extraction


In [19]:
metrics(filtered_versions)

Metrics analysis is skipped as per configuration.


### 3.3 - Labeling


In [20]:
label_all_metrics(couples)

Labelization process is skipped as per configuration.


In [21]:
enrich_metrics(couples)

Enrichment process is skipped as per configuration.


In [22]:
v = [
    "2.0.0", "2.0.1", "2.1.0", "2.1.1", "2.2.0", "2.3.0", "2.3.1", "2.3.2",
    "2.3.3", "2.3.4", "2.3.5", "2.3.6", "2.3.7", "2.3.8", "2.3.9", "2.3.10",
    "3.0.0", "3.1.0", "3.1.1", "3.1.2", "3.1.3", "4.0.0", "4.0.1"
]
merge_all_metrics(v)

Merging has already been done. Skipping...


In [23]:
merge_static_metrics()

Merging has already been done. Skipping...


In [24]:
from AI import run_pipeline
import os
from configparser import ConfigParser
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [25]:
config: ConfigParser = ConfigParser()
config.read("config.ini")

['config.ini']

In [26]:
data_directory = config["GENERAL"]["DataDirectory"]
output_dir = config["UNDERSTAND"]["FullStaticMetricsOutputDirectory"]
file_name = config["UNDERSTAND"]["MergedStaticMetricsFileName"]
file_path = os.path.join(data_directory, output_dir, file_name)

In [27]:
# Logistic Regression with solver and normalization
# metrics_logistic_balanced = run_pipeline(
#     file_path=file_path,
#     model=lambda: LogisticRegression(max_iter=1000, class_weight="balanced"),
#     config_section="VERSION_ALL_LAB"
# )

# Run pipeline for all_versions with Random Forest
# metrics_rf_balanced = run_pipeline(
#     file_path=file_path,
#     model=lambda: RandomForestClassifier(random_state=42, class_weight='balanced'),
#     config_section="VERSION_ALL_LAB"
# )

In [28]:
# metrics_logistic_balanced

In [29]:
# metrics_rf_balanced

## Part 5. - Dynamic Metrics

In [30]:
from Dynamic import convert_json_to_csv, merge_static_and_dynamic_csv, build_dependencies, display_hierarchy, collect_dynamic_metrics_v2
from Hive import filter_versions_by_min

In [31]:
all_versions = filter_versions_by_min(versions, releases_regex,'1.0')
version_json = build_dependencies(all_versions)
display_hierarchy(version_json)

version_json

1.0.0 (2015-02-03) [Commit: 697aeca]
    1.0.1 (2015-05-14) [Commit: 73b600d]
    1.1.0 (2015-03-09) [Commit: e60744d]
        1.1.1 (2015-05-14) [Commit: 3e8d832]
        1.2.0 (2015-05-14) [Commit: 7f237de]
            1.2.1 (2015-06-19) [Commit: 243e7c1]
                1.2.2 (2017-04-01) [Commit: 395368f]
                2.0.0 (2016-02-09) [Commit: 7f9f1fc]
                    2.0.1 (2016-05-03) [Commit: e3cfeeb]
                        2.1.0 (2016-06-17) [Commit: 9265bc2]
                            2.1.1 (2016-11-29) [Commit: 1af77bb]
                                2.2.0 (2017-07-21) [Commit: da840b0]
                                2.3.0 (2017-07-13) [Commit: 6f4c35c]
                                    2.3.1 (2017-10-19) [Commit: 7590572]
                                        2.3.2 (2017-11-09) [Commit: 857a9fd]
                                            2.3.3 (2018-03-28) [Commit: 3f7dde3]
                                                2.3.4 (2018-10-31) [Commit: 56acdd2]

{'1.0.0': {'previous': None,
  'next': ['1.0.1', '1.1.0'],
  'date': '2015-02-03',
  'commit': <git.Commit "697aecadc3ba62bc11f3ba0a6c8522daeec7b53f">,
  'branch_origin': None},
 '1.0.1': {'previous': '1.0.0',
  'next': None,
  'date': '2015-05-14',
  'commit': <git.Commit "73b600dc79ba8a9a32078a2ea0eb8ae3df20c9d5">,
  'branch_origin': None},
 '1.1.0': {'previous': '1.0.0',
  'next': ['1.1.1', '1.2.0'],
  'date': '2015-03-09',
  'commit': <git.Commit "e60744d017ef79f1b17f474c0b969d4ca5592462">,
  'branch_origin': None},
 '1.1.1': {'previous': '1.1.0',
  'next': None,
  'date': '2015-05-14',
  'commit': <git.Commit "3e8d832a1a8e2b12029adcb55862cf040098ef0f">,
  'branch_origin': None},
 '1.2.0': {'previous': '1.1.0',
  'next': ['1.2.1'],
  'date': '2015-05-14',
  'commit': <git.Commit "7f237de447bcd726bb3d0ba332cbb733f39fc02f">,
  'branch_origin': None},
 '1.2.1': {'previous': '1.2.0',
  'next': ['1.2.2', '2.0.0'],
  'date': '2015-06-19',
  'commit': <git.Commit "243e7c1ac39cb7ac8b65c5bc

In [32]:
dynamic_metrics = collect_dynamic_metrics_v2(version_json)

Starting parallel processing with 8 threads...
Processing commits between 1.2.1 (243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558) and 2.0.0 (7f9f1fcb8697fb33f0edc2c391930a3728d247d7)
Processing commits between 2.0.1 (e3cfeebcefe9a19c5055afdcbb00646908340694) and 2.1.0 (9265bc24d75ac945bde9ce1a0999fddd8f2aae29)
Processing commits between 2.3.1 (7590572d9265e15286628013268b2ce785c6aa08) and 2.3.2 (857a9fd8ad725a53bd95c1b2d6612f9b1155f44d)
Processing commits between 2.1.0 (9265bc24d75ac945bde9ce1a0999fddd8f2aae29) and 2.1.1 (1af77bbf8356e86cabbed92cfa8cc2e1470a1d5c)
Processing commits between 2.0.0 (7f9f1fcb8697fb33f0edc2c391930a3728d247d7) and 2.0.1 (e3cfeebcefe9a19c5055afdcbb00646908340694)
Processing commits between 2.1.1 (1af77bbf8356e86cabbed92cfa8cc2e1470a1d5c) and 2.2.0 (da840b0f8fa99cab9f004810cd22abc207493cae)
Processing commits between 2.1.1 (1af77bbf8356e86cabbed92cfa8cc2e1470a1d5c) and 2.3.0 (6f4c35c9e904d226451c465effdc5bfd31d395a0)
Processing commits between 2.3.0 (6f4c35c9e904d226

In [33]:
convert_json_to_csv()

Dynamic metrics CSV created: Output/dynamic_metrics_output/2.3.1_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/2.3.3_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/2.3.2_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/2.3.4_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/2.3.5_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/2.0.1_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/2.3.6_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/3.1.1_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/2.3.7_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/3.1.2_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/2.3.8_dynamic_metrics.csv
Dynamic metrics CSV created: Output/dynamic_metrics_output/3.1.3_

In [34]:
merge_static_and_dynamic_csv()

Static metrics file missing for version 3.1.3. Skipping...
Static metrics file missing for version 3.1.2. Skipping...
Static metrics file missing for version 3.1.1. Skipping...
Static metrics file missing for version 2.2.0. Skipping...
Static metrics file missing for version 3.1.0. Skipping...
Static metrics file missing for version 2.0.0. Skipping...
Static metrics file missing for version 2.0.1. Skipping...
Static metrics file missing for version 2.3.1. Skipping...
Static metrics file missing for version 2.3.0. Skipping...
Static metrics file missing for version 3.0.0. Skipping...
Static metrics file missing for version 2.3.3. Skipping...
Static metrics file missing for version 2.3.2. Skipping...
Static metrics file missing for version 2.3.4. Skipping...
Static metrics file missing for version 2.1.0. Skipping...
Static metrics file missing for version 2.3.5. Skipping...
Static metrics file missing for version 2.1.1. Skipping...
Static metrics file missing for version 2.3.6. Skipping.