# Prow Logs and GCS Data

In [1]:
import gzip
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
from google.cloud import storage

### What do we have access to as data scientists when digging into the build artifacts?

In this notebook we want to determine how to interact with the availble data on GCS. We will provide some EDA and perfrom 2 different tasks.


1. Compare the different log files present throughout the archives and quantify how complete and compareable our log dataset is.
1. Download a small sample dataset of events and build logs to perform some EDA on.  


In [2]:
# Load test file to access our record of job names
with gzip.open("../../../data/raw/testgrid_810.json.gz", "rb") as read_file:
    testgrid_data = json.load(read_file)

### Example to access a single prow artifacts page

Let's make sure we understand how this works, and focus on a single job first.

In [3]:
tab = '"redhat-openshift-ocp-release-4.6-informing"'
grid = "periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade"

In [4]:
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{grid}"
)
soup = BeautifulSoup(response.text, "html.parser")
list_of_builds = [x.get_text()[1:-1] for x in soup.find_all("a")][1:-1]

In [5]:
response = requests.get(
    f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{grid}/{list_of_builds[1]}"
)
response.url

'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-upgrade/1364869749170769920'

In [6]:
soup = BeautifulSoup(response.text, "html.parser")

In [7]:
[x.get_text() for x in soup.find_all("a")]

[' ..',
 ' artifacts/',
 ' build-log.txt',
 ' finished.json',
 ' podinfo.json',
 ' prowjob.json',
 ' started.json']

Great, we can now programmatically access the archives. Now, lets walk through all of the job archives for a single job and create a list of what they have on the first depth level of thier directory.  

In [8]:
build_data = {}

for build in list_of_builds:
    response = requests.get(
        f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{grid}/{build}"
    )
    soup = BeautifulSoup(response.text, "html.parser")
    artifacts = [x.get_text() for x in soup.find_all("a")]
    build_data[build] = artifacts

In [9]:
builds_info = pd.Series({k: len(v) for (k, v) in build_data.items()})

In [10]:
builds_info.value_counts()

7    210
6      3
4      1
5      1
dtype: int64

In [11]:
pd.Series(build_data).apply(" ".join).value_counts()

 ..  artifacts/  build-log.txt  finished.json  podinfo.json  prowjob.json  started.json    210
 ..  artifacts/  build-log.txt  finished.json  prowjob.json  started.json                    2
 ..  build-log.txt  finished.json  podinfo.json  prowjob.json  started.json                  1
 ..  build-log.txt  finished.json  prowjob.json  started.json                                1
 ..  artifacts/  prowjob.json  started.json                                                  1
dtype: int64

In [12]:
builds_info.value_counts() / len(builds_info)

7    0.976744
6    0.013953
4    0.004651
5    0.004651
dtype: float64

97% percent of our records for this job include appear to be complete and include the 'artifacts/' subdirectory, lets dig in and see what they contain. 

In [13]:
build_data = {}

for build in list_of_builds:
    response = requests.get(
        f"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/{grid}/{build}/artifacts"
    )
    soup = BeautifulSoup(response.text, "html.parser")
    artifacts = [x.get_text() for x in soup.find_all("a")]
    build_data[build] = artifacts

In [14]:
artifact_info = pd.Series({k: len(v) for (k, v) in build_data.items()})
artifact_info.value_counts()

6    95
7    43
5    36
8    33
1     2
2     2
3     2
4     2
dtype: int64

In [15]:
artifact_info.value_counts() / len(artifact_info)

6    0.441860
7    0.200000
5    0.167442
8    0.153488
1    0.009302
2    0.009302
3    0.009302
4    0.009302
dtype: float64

In [16]:
pd.Series(build_data).apply(" ".join).value_counts()

 ..  build-resources/  e2e-gcp-upgrade/  release/  junit_operator.xml  metadata.json                                                  92
 ..  build-resources/  e2e-gcp-upgrade/  release/  ci-operator.log  junit_operator.xml  metadata.json                                 39
 ..  build-resources/  e2e-gcp-upgrade/  release/  ci-operator-step-graph.json  ci-operator.log  junit_operator.xml  metadata.json    33
 ..  build-resources/  e2e-gcp-upgrade/  junit_operator.xml  metadata.json                                                            27
 ..  build-resources/  release/  junit_operator.xml  metadata.json                                                                     9
 ..  build-resources/  release/  ci-operator-step-graph.json  ci-operator.log  junit_operator.xml  metadata.json                       4
 ..  build-resources/  release/  ci-operator.log  junit_operator.xml  metadata.json                                                    2
 ..                                      

We can see that once we get down into the artifacts there is a fair bit less uniformity to the data available to us. And this is all within a single job! We will look into it next, but my assumptions is that this issue gets worse when comparing available artifacts across jobs. 

This will make it somewhat difficult to use these sets of documents as a whole to compare different CI behaviour. At this point, it might make sense to consider looking only at the same document (log) across job when availble. 

In the next section we are going to walkthrough how to access and parse the build-log as well as the events json both to download a small testing dataset our working directly with the data in memory.  

## Collecting build logs

In [17]:
def download_public_file(bucket_name, source_blob_name):
    """Downloads a public blob from the bucket."""
    storage_client = storage.Client.create_anonymous_client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    if blob.exists(storage_client):
        text = blob.download_as_text()
    else:
        text = ""
    return text

In [18]:
build_log_data = {}
for build in list_of_builds:
    file = download_public_file("origin-ci-test", f"logs/{grid}/{build}/build-log.txt")
    build_log_data[build] = file

In [19]:
def get_counts(x):
    """
    Gets counts for chars, words, lines for a log.
    """
    if x:
        chars = len(x)
        words = len(x.split())
        lines = x.count("\n") + 1
        return chars, words, lines
    else:
        return 0, 0, 0

In [20]:
## Create a dataframe with char, words, and lines
## count for the logs
data = []
for key, value in build_log_data.items():
    chars, words, lines = get_counts(value)
    data.append([key, chars, words, lines])

df = pd.DataFrame(data=data, columns=["build_log_id", "chars", "words", "lines"])
df

Unnamed: 0,build_log_id,chars,words,lines
0,1364778930069835776,517,51,5
1,1364869749170769920,10805,919,111
2,1364960659313266688,10807,919,111
3,1365051265100288000,15758,1316,184
4,1365142142841786368,10804,919,111
...,...,...,...,...
210,1383782446109036544,37601,2690,261
211,1383873157873537024,16784,1167,120
212,1383963863615016960,4331,246,42
213,1384054560330354688,16787,1167,120


#### See the stats for chars, words, lines

In [21]:
df["chars"].describe()

count    2.150000e+02
mean     1.702544e+05
std      8.190881e+05
min      0.000000e+00
25%      1.080750e+04
50%      1.104000e+04
75%      1.633500e+04
max      6.662932e+06
Name: chars, dtype: float64

In [22]:
df["words"].describe()

count       215.000000
mean      11718.827907
std       57193.249040
min           0.000000
25%         906.500000
50%         926.000000
75%        1287.000000
max      508998.000000
Name: words, dtype: float64

In [23]:
df["lines"].describe()

count      215.00000
mean       703.47907
std       2920.66318
min          0.00000
25%        111.00000
50%        112.00000
75%        120.00000
max      21291.00000
Name: lines, dtype: float64

* From the initial analysis, we see that we have log files with 2 lines to ~21,000 lines with a mean of 703 lines. This suggests high variability. The next few things to look at would be similarity betwen the log files, performing word analysis, templating, and clustering.

## Collect the events data

In [24]:
build_events_data = {}
for build in list_of_builds:
    file = download_public_file(
        "origin-ci-test", f"logs/{grid}/{build}/artifacts/build-resources/events.json"
    )
    if file:
        build_events_data[build] = json.loads(file)
    else:
        build_events_data[build] = ""

In [25]:
## Percentage of builds that have the events.json file
count = 0
for key, value in build_events_data.items():
    if value:
        count += 1
count * 100 / len(build_events_data)

97.20930232558139

In [26]:
# Analyzing the messages of a single build
messages = [
    (i["metadata"]["uid"], i["message"])
    for i in build_events_data["1364869749170769920"]["items"]
]
messages_df = pd.DataFrame(messages, columns=["UID", "message"])
messages_df

Unnamed: 0,UID,message
0,504b881c-9e97-46a0-b206-765c9973e1d3,Running job periodic-ci-openshift-release-mast...
1,3a92467e-993e-43a1-8eee-24ba7a22508f,Running job periodic-ci-openshift-release-mast...
2,ed40875c-b182-4164-b730-a3754ed94124,Running job periodic-ci-openshift-release-mast...
3,b8c5b026-936e-4f28-a0de-62051ff378d8,Running job periodic-ci-openshift-release-mast...
4,b1b8265a-d87d-430d-b38e-c10c7f3fb91c,No matching pods found
...,...,...
477,db97f7b5-5fb4-4e68-aa82-cd9c120b9c8d,"Container image ""gcr.io/k8s-prow/sidecar:v2021..."
478,947a5c3f-f318-42d8-88c5-133f9783e94d,Created container sidecar
479,e94e0988-c2ea-4045-bb84-251410018d94,Started container sidecar
480,00bef865-7965-4b65-b39a-9993347c5942,"Back-off pulling image ""image-registry.openshi..."


In [27]:
messages_df["message"].describe()

count                           482
unique                          156
top       Started container sidecar
freq                             29
Name: message, dtype: object

In [28]:
messages_df["message"].value_counts().reset_index()

Unnamed: 0,index,message
0,Started container sidecar,29
1,Created container sidecar,29
2,Created container place-entrypoint,29
3,"Container image ""gcr.io/k8s-prow/entrypoint:v2...",29
4,"Container image ""gcr.io/k8s-prow/sidecar:v2021...",29
...,...,...
151,(combined from similar events): Failed to calc...,1
152,Add eth0 [10.131.37.32/23],1
153,"Successfully pulled image ""registry.ci.openshi...",1
154,Add eth0 [10.129.9.17/23],1


In the build data, we saw that about 96.5% builds have the events.json file. We further analyzed all the events that happened for a particular build and found the frequencies of the messages. We can repeat the process for all the other builds and find most common messages and perform further analysis.

## Conclusion

In this notebook, we saw how to programmatically access the archives, build logs, and the events json. Most of the files in root directory were the same but we saw a higher variability in the contents of the artifcats directory. For the logs, we saw a high variability in the number of lines indicating that we need a more thorough EDA. The events.json lead to some neat messages that can be explored furthur for a larger dataset.