# Data Exploration

This notebook contains exploratory analysis of our Dataset.
It displays individual sections and describes what parts were picked
to final dataset and why were they picked.

## Analysis of raw data

Every inspection unpacked from archive dataset has the following structure:

- `inspection id`
    - **build**
        - *Dockerfile*
        - *log*
        - *specification*
    - **results**
        - **0**
            - *hwinfo*
            - *log*
            - *result*
        - **1**
            - *hwinfo*
            - *log*
            - *result*

where total results depends on the `batch_size` selected when running Amun API.

In [1]:
import os
import sys

# Set-up notebook environment to include thoth_issue_predictor module.
module_path = os.path.abspath(os.path.join("../.."))
if module_path not in sys.path:
    sys.path.append(module_path)

module_path

'/home/tjanicek/thesis/thoth-issue-predictor'

In [2]:
from thoth_issue_predictor.preprocessing.preprocessing import Preprocessing
from thoth.report_processing.components.inspection import (
    AmunInspectionsSummary,
)

inspection_runs_summary = AmunInspectionsSummary()

## Retrieve the data

Parsed inspection is divided into three parts [Copied from Thoth documentation]:

* **Base Image** (e.g. rhel8, ubi8, thoth-ubi8-python36)
* **RPMs/Debian packages List**
* **Pinned Down Software Stack** (Pipfile/Pipfile.lock)
* **Hardware Requirement** (e.g. CPU only, GPU)
* **Performance Indicator (PI) and parameters**


In each result it is possible to find the following info [Copied from Thoth documentation]:

* **start_datetime**, when the inspection started;
* **end_datetime**, when the inspection ended;
* **document_id**, Document ID;
* **identifier**, Inspection identifier;
* **hwinfo**, hardware information where the inspection has been run;
    * **cpu_features**, flags, Frequency, l1, l2 ,l3 cache sizes [KB];
    * **cpu_info**, CPU info (e.g brand, vendor_id, family, model);
    * **cpu_type**, flags identifying CPU Type (e.g. 'is_XEON': True);
    * **platform**;
        * **architecture**;
        * **machine**;
        * **node**;
        * **platform**;
        * **release**;
        * **version**;
        * **processor**;
* **os_release**, OS info taken from `"/etc/os-release"`;
* **runtime_environment**, runtime environment info;
    * **cuda_version**, CUDA version;
    * **hardware**, HW info, cpu family and model;
    * **operating_system**, OS name and version;
    * **python_version**;
* **script_sha256**, unique ID of the Performance Indicator used;
* **stdout**;
    * **@parameters**, parameters specific of the PI;
    * **@results**, results after running the PI (rate[GFLOPS] and elapsed time [ms]);
    * **component**, for what component or library (e.g tensorflow, pytorch);
    * **name**, name of the PI (e.g. PiConv2D);
    * **{component}_buildinfo**, build info for the specific component (e.g AICoE Tensorflow);
* **requirements**, e.g Pipfile;
* **requirements_locked** e.g Pipfile.lock;
* **stderr**;
* **exit_code**;
* **usage**

In [3]:
preprocessing = Preprocessing()
inspections_df = preprocessing.prepare_df()
f"Length of inspection DF is {len(inspections_df)}"

'Length of inspection DF is 248'

In [4]:
inspections_df.head()

Unnamed: 0,end_datetime,exit_code,hostname,script_sha256,start_datetime,stderr,inspection_document_id,identifier,specification_base,batch_size,...,flag__xsaveopt,flag__xtopology,flag__arch_capabilities,flag__cpuid_fault,flag__umip,flag__xsaves,inspection_start,inspection_end,inspection_duration,inspection_batch
0,,0,inspection-tf-dm-six-d9316be5-3840106319,8e0ad2b4cac88850e268b9d7b72b8569e564ac2ac106cf...,2020-09-09T05:11:07.368371,2020-09-09 05:11:08.843789: W tensorflow/strea...,inspection-tf-dm-six-d9316be5,tf-dm-six,quay.io/thoth-station/s2i-thoth-ubi8-py36,1,...,True,True,,,,,2020-09-09 05:11:07.368371,2020-09-09 05:11:09.830873,2,1
1,,0,inspection-tf-dm-six-0d800caa-447756917,8e0ad2b4cac88850e268b9d7b72b8569e564ac2ac106cf...,2020-09-09T06:36:15.861119,2020-09-09 06:36:17.773920: W tensorflow/strea...,inspection-tf-dm-six-0d800caa,tf-dm-six,quay.io/thoth-station/s2i-thoth-ubi8-py36,1,...,True,True,,,,,2020-09-09 06:36:15.861119,2020-09-09 06:36:19.130500,3,1
2,,0,inspection-tf-dm-tf24-0a2ec44b-1869278707,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,2021-02-10T17:08:19.224421,2021-02-10 17:08:19.906690: W tensorflow/strea...,inspection-tf-dm-tf24-0a2ec44b,tf-dm-tf24,quay.io/thoth-station/s2i-thoth-ubi8-py38:v0.24.2,1,...,True,True,1.0,1.0,1.0,1.0,2021-02-10 17:08:19.224421,2021-02-10 17:08:41.711514,22,1
3,,0,inspection-tf-dm-rw-c294ae3d-828883074,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,2020-09-04T12:59:38.387052,2020-09-04 12:59:40.213973: W tensorflow/strea...,inspection-tf-dm-rw-c294ae3d,tf-dm-rw,quay.io/thoth-station/s2i-thoth-ubi8-py36,1,...,True,True,,,,,2020-09-04 12:59:38.387052,2020-09-04 13:00:16.658982,38,1
4,,0,inspection-tf-dm-tf24-0a6a0d5e-844550074,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,2021-02-19T05:28:23.392495,2021-02-19 05:28:24.188988: W tensorflow/strea...,inspection-tf-dm-tf24-0a6a0d5e,tf-dm-tf24,quay.io/thoth-station/s2i-thoth-ubi8-py38:v0.24.2,1,...,True,True,1.0,1.0,1.0,1.0,2021-02-19 05:28:23.392495,2021-02-19 05:29:09.782311,46,1


In [5]:
inspections_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 248 entries, 0 to 247
Columns: 389 entries, end_datetime to inspection_batch
dtypes: datetime64[ns](2), float64(67), object(320)
memory usage: 755.6+ KB


In [6]:
# Display the number of failed software sack and some examples.
failed_inspections = inspections_df[inspections_df["exit_code"] == 1]
print(f"Number of failed inspections: {len(failed_inspections)}")
failed_inspections.head()

Number of failed inspections: 38


Unnamed: 0,end_datetime,exit_code,hostname,script_sha256,start_datetime,stderr,inspection_document_id,identifier,specification_base,batch_size,...,flag__xsaveopt,flag__xtopology,flag__arch_capabilities,flag__cpuid_fault,flag__umip,flag__xsaves,inspection_start,inspection_end,inspection_duration,inspection_batch
210,,1,inspection-tf-dm-rw-a4120159-1845172180,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,2020-09-05T07:20:03.449550,2020-09-05 07:20:06.212092: W tensorflow/strea...,inspection-tf-dm-rw-a4120159,tf-dm-rw,quay.io/thoth-station/s2i-thoth-ubi8-py36,1,...,True,True,,,,,2020-09-05 07:20:03.449550,2020-09-05 07:20:06.706654,3,1
211,,1,inspection-tf-dm-rw-aa4edf73-593141074,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,2020-09-05T08:28:43.762961,2020-09-05 08:28:46.450120: W tensorflow/strea...,inspection-tf-dm-rw-aa4edf73,tf-dm-rw,quay.io/thoth-station/s2i-thoth-ubi8-py36,1,...,True,True,,,,,2020-09-05 08:28:43.762961,2020-09-05 08:28:47.101035,3,1
212,,1,inspection-tf-dm-rw-a5a58f40-1640460587,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,2020-09-05T08:15:15.351333,2020-09-05 08:15:20.506670: W tensorflow/strea...,inspection-tf-dm-rw-a5a58f40,tf-dm-rw,quay.io/thoth-station/s2i-thoth-ubi8-py36,1,...,True,True,,,,,2020-09-05 08:15:15.351333,2020-09-05 08:15:21.497862,6,1
213,,1,inspection-tf-dm-six-80f213d6-1091086974,8e0ad2b4cac88850e268b9d7b72b8569e564ac2ac106cf...,2020-09-09T06:08:12.758128,2020-09-09 06:08:14.604801: W tensorflow/strea...,inspection-tf-dm-six-80f213d6,tf-dm-six,quay.io/thoth-station/s2i-thoth-ubi8-py36,1,...,True,True,,,,,2020-09-09 06:08:12.758128,2020-09-09 06:08:14.993665,2,1
214,,1,inspection-tf-dm-rw-eecfffae-2356673894,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,2020-09-05T07:04:36.328976,2020-09-05 07:04:38.203494: W tensorflow/strea...,inspection-tf-dm-rw-eecfffae,tf-dm-rw,quay.io/thoth-station/s2i-thoth-ubi8-py36,1,...,True,True,,,,,2020-09-05 07:04:36.328976,2020-09-05 07:04:38.666282,2,1


## Create reports from inspections

The creation of report transforms data into more readable form and
separates them into five sections:
 - Software stacks
 - Hardware
 - Exit codes
 - Base image
 - Performance Indicators


In [7]:
report_results, _ = inspection_runs_summary.produce_summary_report(
    inspections_df=inspections_df
)

In [8]:
report_results.keys()

dict_keys(['hardware', 'base_image', 'software_stack', 'pi', 'exit_codes'])

### Software Stack

Software stack section contains information about the installed
package and its dependencies. Python version, dependencies and
their indexes serve as the basis for our dataset. Changing Python
version would result in a totally different software stack. The same is
true for changing the package index. Even if these two packages have
the same name and version, there is no guarantee that they will be the
same.

In [9]:
report_results["software_stack"].keys()

dict_keys(['requirements_locked'])

In [10]:
report_results["software_stack"]["requirements_locked"].keys()

Index(['requirements_locked___meta__sources',
       'requirements_locked___meta__requires__python_version',
       'requirements_locked___meta__hash__sha256',
       'requirements_locked___meta__pipfile-spec',
       'requirements_locked__default__tensorflow__version',
       'requirements_locked__default__tensorflow__index',
       'requirements_locked__default__absl-py__version',
       'requirements_locked__default__absl-py__index',
       'requirements_locked__default__astor__version',
       'requirements_locked__default__astor__index',
       'requirements_locked__default__gast__version',
       'requirements_locked__default__gast__index',
       'requirements_locked__default__google-pasta__version',
       'requirements_locked__default__google-pasta__index',
       'requirements_locked__default__grpcio__version',
       'requirements_locked__default__grpcio__index',
       'requirements_locked__default__keras-applications__version',
       'requirements_locked__default__keras-a

In [11]:
report_results["software_stack"]["requirements_locked"].head(11)

Unnamed: 0,requirements_locked___meta__sources,requirements_locked___meta__requires__python_version,requirements_locked___meta__hash__sha256,requirements_locked___meta__pipfile-spec,requirements_locked__default__tensorflow__version,requirements_locked__default__tensorflow__index,requirements_locked__default__absl-py__version,requirements_locked__default__absl-py__index,requirements_locked__default__astor__version,requirements_locked__default__astor__index,...,requirements_locked__default__flatbuffers__version,requirements_locked__default__flatbuffers__index,requirements_locked__default__astunparse__version,requirements_locked__default__astunparse__index,requirements_locked__default__tensorboard-plugin-wit__version,requirements_locked__default__tensorboard-plugin-wit__index,requirements_locked__default__packaging__version,requirements_locked__default__packaging__index,requirements_locked__default__pyparsing__version,requirements_locked__default__pyparsing__index
0,,3.6,c9e7f586cc7f527664b35f4581b3d5421512489efb2c89...,6,==2.1.0,pypi-org,==0.8.0,pypi-org,==0.8.0,pypi-org,...,,,,,,,,,,
1,,3.6,c9e7f586cc7f527664b35f4581b3d5421512489efb2c89...,6,==2.1.0,pypi-org,==0.8.0,pypi-org,==0.8.0,pypi-org,...,,,,,,,,,,
2,,3.8,11ae15263f6d0a834ab2b568343195bdd01337d61b784d...,6,==2.4.0,pypi-org-simple,==0.11.0,pypi-org-simple,,,...,==1.12,pypi-org-simple,==1.6.3,pypi-org-simple,==1.6.0.post2,pypi-org-simple,,,,
3,,3.6,c9e7f586cc7f527664b35f4581b3d5421512489efb2c89...,6,==2.1.0,pypi-org,==0.9.0,pypi-org,==0.6.2,pypi-org,...,,,,,,,,,,
4,,3.8,36ef97fc4b26a5658b7cce806573df8356fdd6557a258f...,6,==2.4.1,pypi-org-simple,==0.10.0,pypi-org-simple,,,...,==1.12,pypi-org-simple,==1.6.3,pypi-org-simple,==1.6.0.post3,pypi-org-simple,==20.7,pypi-org-simple,==2.1.8,pypi-org-simple
5,,3.6,c9e7f586cc7f527664b35f4581b3d5421512489efb2c89...,6,==2.1.0,pypi-org,==0.8.0,pypi-org,==0.8.0,pypi-org,...,,,,,,,,,,
6,,3.8,11ae15263f6d0a834ab2b568343195bdd01337d61b784d...,6,==2.4.0,pypi-org-simple,==0.11.0,pypi-org-simple,,,...,==1.12,pypi-org-simple,==1.6.3,pypi-org-simple,==1.6.0.post3,pypi-org-simple,,,,
7,,3.6,c9e7f586cc7f527664b35f4581b3d5421512489efb2c89...,6,==2.1.0,pypi-org,==0.7.0,pypi-org,==0.6.2,pypi-org,...,,,,,,,,,,
8,,3.6,c9e7f586cc7f527664b35f4581b3d5421512489efb2c89...,6,==2.1.0,pypi-org,==0.8.0,pypi-org,==0.8.0,pypi-org,...,,,,,,,,,,
9,,3.6,c9e7f586cc7f527664b35f4581b3d5421512489efb2c89...,6,==2.1.0,pypi-org,==0.8.0,pypi-org,==0.8.0,pypi-org,...,,,,,,,,,,


### Exit codes

This section contains the result information about inspection jobs. Exit codes are
used as class labels for our classification models.

In [12]:
report_results["exit_codes"]["exit_code"]

Unnamed: 0,exit_code
0,0
1,1


In [13]:
inspections_df["exit_code"].head()

0    0
1    0
2    0
3    0
4    0
Name: exit_code, dtype: object

### Base image

Information about environment used to resolve, build and execute software stacks.

In [14]:
report_results["base_image"]["base_image"].head()

Unnamed: 0,specification_base,os_release__name,os_release__version,os_release__version_id
0,quay.io/thoth-station/s2i-thoth-ubi8-py36,Red Hat Enterprise Linux,8.2 (Ootpa),8.2
1,quay.io/thoth-station/s2i-thoth-ubi8-py38:v0.24.2,Red Hat Enterprise Linux,8.3 (Ootpa),8.3


### Hardware

Hardware section contains information about the platform and
build metadata. Properties like architecture, processor, and build time.
None of these data are used in our dataset.

In [15]:
report_results["hardware"]["platform"].head()

Unnamed: 0,hwinfo__platform__architecture,hwinfo__platform__machine,hwinfo__platform__node,hwinfo__platform__platform,hwinfo__platform__processor,hwinfo__platform__release,hwinfo__platform__version
0,,x86_64,inspection-tf-dm-six-d9316be5-3840106319,Linux-4.18.0-147.8.1.el8_1.x86_64-x86_64-with-...,x86_64,4.18.0-147.8.1.el8_1.x86_64,#1 SMP Wed Feb 26 03:08:15 UTC 2020
1,,x86_64,inspection-tf-dm-six-0d800caa-447756917,Linux-4.18.0-147.8.1.el8_1.x86_64-x86_64-with-...,x86_64,4.18.0-147.8.1.el8_1.x86_64,#1 SMP Wed Feb 26 03:08:15 UTC 2020
2,,x86_64,inspection-tf-dm-tf24-0a2ec44b-1869278707,Linux-4.18.0-193.41.1.el8_2.x86_64-x86_64-with...,x86_64,4.18.0-193.41.1.el8_2.x86_64,#1 SMP Wed Jan 13 11:33:33 EST 2021
3,,x86_64,inspection-tf-dm-rw-c294ae3d-828883074,Linux-4.18.0-147.8.1.el8_1.x86_64-x86_64-with-...,x86_64,4.18.0-147.8.1.el8_1.x86_64,#1 SMP Wed Feb 26 03:08:15 UTC 2020
4,,x86_64,inspection-tf-dm-tf24-0a6a0d5e-844550074,Linux-4.18.0-193.41.1.el8_2.x86_64-x86_64-with...,x86_64,4.18.0-193.41.1.el8_2.x86_64,#1 SMP Wed Jan 13 11:33:33 EST 2021


### Performance Indicators

This data are redundant for this dataset and were all discarded.

In [16]:
report_results["pi"]["pi"].head()

Unnamed: 0,script_sha256,batch_size,stdout__component,stdout__name,stdout__@parameters__device,stdout__@parameters__dtype,stdout__@parameters__matrix_size,stdout__@parameters__mini_batch,stdout__@parameters__reps
0,8e0ad2b4cac88850e268b9d7b72b8569e564ac2ac106cf...,1,tensorflow,PiImport,,,,,
1,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,1,tensorflow,PiMatmul,cpu,float32,512.0,40.0,50.0
2,1ce2ae4cf0da06c8c55b99be8beb31cae7c6801bb99664...,1,,,,,,,
3,8e0ad2b4cac88850e268b9d7b72b8569e564ac2ac106cf...,1,,,,,,,
