# Prow Logs and GCS Data

In [1]:
import gzip
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

### What do we have access to as data scientists when digging into the build artifacts?

In this notebook we want to determine how to interact with the availble data on GCS. We will provide some EDA and perfrom 2 different tasks.


1. Compare the different log files present throughout the archives and quantify how complete and compareable our log dataset is.
1. Download a small sample dataset of events and build logs to perform some EDA on.  


In [2]:
# Load test file to access our record of job names
with gzip.open("../../../data/raw/testgrid_810.json.gz", "rb") as read_file:
    testgrid_data = json.load(read_file)

### Example to access a single prow artifacts page

Let's make sure we understand how this works, and focus on a single job first.

In [3]:
tab = '"redhat-openshift-ocp-release-4.6-informing"'
grid = "periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade"

In [4]:
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{grid}"
)
soup = BeautifulSoup(response.text, "html.parser")
list_of_builds = [x.get_text()[1:-1] for x in soup.find_all("a")][1:-1]

In [5]:
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{grid}/{list_of_builds[1]}"
)
response.url

'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade/1364869749170769920'

In [6]:
soup = BeautifulSoup(response.text, "html.parser")

In [7]:
[x.get_text() for x in soup.find_all("a")]

[' ..',
 ' artifacts/',
 ' build-log.txt',
 ' finished.json',
 ' podinfo.json',
 ' prowjob.json',
 ' started.json']

Great, we can now programmatically access the archives. Now, lets walk through all of the job archives for a single job and create a list of what they have on the first depth level of thier directory.  

In [8]:
build_data = {}

for build in list_of_builds:
    response = requests.get(
        f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{grid}/{build}"
    )
    soup = BeautifulSoup(response.text, "html.parser")
    artifacts = [x.get_text() for x in soup.find_all("a")]
    build_data[build] = artifacts

In [9]:
builds_info = pd.Series({k: len(v) for (k, v) in build_data.items()})

In [10]:
builds_info.value_counts()

7    135
6      3
4      1
5      1
dtype: int64

In [11]:
pd.Series(build_data).apply(" ".join).value_counts()

 ..  artifacts/  build-log.txt  finished.json  podinfo.json  prowjob.json  started.json    135
 ..  artifacts/  build-log.txt  finished.json  prowjob.json  started.json                    2
 ..  build-log.txt  finished.json  prowjob.json  started.json                                1
 ..  build-log.txt  finished.json  podinfo.json  prowjob.json  started.json                  1
 ..  artifacts/  prowjob.json  started.json                                                  1
dtype: int64

In [12]:
builds_info.value_counts() / len(builds_info)

7    0.964286
6    0.021429
4    0.007143
5    0.007143
dtype: float64

97% percent of our records for this job include appear to be complete and include the 'artifacts/' subdirectory, lets dig in and see what they contain. 

In [13]:
build_data = {}

for build in list_of_builds:
    response = requests.get(
        f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{grid}/{build}/artifacts"
    )
    soup = BeautifulSoup(response.text, "html.parser")
    artifacts = [x.get_text() for x in soup.find_all("a")]
    build_data[build] = artifacts

In [14]:
artifact_info = pd.Series({k: len(v) for (k, v) in build_data.items()})
artifact_info.value_counts()

6    92
5    36
7     4
1     2
2     2
3     2
4     2
dtype: int64

In [15]:
artifact_info.value_counts() / len(artifact_info)

6    0.657143
5    0.257143
7    0.028571
1    0.014286
2    0.014286
3    0.014286
4    0.014286
dtype: float64

In [16]:
pd.Series(build_data).apply(" ".join).value_counts()

 ..  build-resources/  e2e-gcp-upgrade/  release/  junit_operator.xml  metadata.json                     92
 ..  build-resources/  e2e-gcp-upgrade/  junit_operator.xml  metadata.json                               27
 ..  build-resources/  release/  junit_operator.xml  metadata.json                                        9
 ..  build-resources/  e2e-gcp-upgrade/  release/  ci-operator.log  junit_operator.xml  metadata.json     4
 ..  build-resources/  junit_operator.xml  metadata.json                                                  2
 ..                                                                                                       2
 ..  junit_job.xml                                                                                        1
 ..  junit_job.xml  metadata.json                                                                         1
 ..  e2e-gcp-upgrade/  release/                                                                           1
 ..  release/               

We can see that once we get down into the artifacts there is a fair bit less uniformity to the data available to us. And this is all within a single job! We will look into it next, but my assumptions is that this issue gets worse when comparing available artifacts across jobs. 

This will make it somewhat difficult to use these sets of documents as a whole to compare different CI behaviour. At this point, it might make sense to consider looking only at the same document (log) across job when availble. 

In the next section we are going to walkthrough how to access and parse the build-log as well as the events json both to download a small testing dataset our working directly with the data in memory.  

## Quick example of reading build logs in memory

In [17]:
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/\
                        gcs/origin-ci-test/logs/{grid}/1364778930069835776/\
                        build-log.txt"
)

In [18]:
response.url

'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade/1364778930069835776/build-log.txt'

In [19]:
response.text

'2021/02/25 03:27:14 ci-operator version v20210224-231f07b\n2021/02/25 03:27:14 Loading configuration from https://config.ci.openshift.org for openshift/release@master [ci-4.6-upgrade-from-stable-4.5]\nerror: failed to load configuration: got unexpected http 404 status code from configresolver: failed to get config: could not find any config for branch master on repo openshift/release\ntime="2021-02-25T03:27:14Z" level=info msg="Reporting job state \'failed\' with reason \'loading_args:loading_config:config_resolver\'"\n'

In [20]:
soup = BeautifulSoup(response.text, "html.parser")
print(soup)

2021/02/25 03:27:14 ci-operator version v20210224-231f07b
2021/02/25 03:27:14 Loading configuration from https://config.ci.openshift.org for openshift/release@master [ci-4.6-upgrade-from-stable-4.5]
error: failed to load configuration: got unexpected http 404 status code from configresolver: failed to get config: could not find any config for branch master on repo openshift/release
time="2021-02-25T03:27:14Z" level=info msg="Reporting job state 'failed' with reason 'loading_args:loading_config:config_resolver'"

