# Exploratory code

The software in this file represents some of our explorations and experiments that were used to make our final analysis.

## Parsing the schema file

In [1]:
from schema import data_schemas_from_file

schemas = data_schemas_from_file("../data/schema.csv")

from tabulate import tabulate

for key,value in schemas.items():
    print("#########################################################################")
    print(f"{key} ({value['file pattern']}):")
    print(tabulate(value['fields'],headers='keys'))

#########################################################################
job_events (job_events/part-?????-of-?????.csv.gz):
  field number  content           format       mandatory    formatter
--------------  ----------------  -----------  -----------  ----------------------------------------------------------------
             0  time              INTEGER      True         <function parse_schema_line.<locals>.<lambda> at 0x7f8692402680>
             1  missing info      INTEGER      False        <function parse_schema_line.<locals>.<lambda> at 0x7f86924031c0>
             2  job ID            INTEGER      True         <function parse_schema_line.<locals>.<lambda> at 0x7f8692403250>
             3  event type        INTEGER      True         <function parse_schema_line.<locals>.<lambda> at 0x7f86924032e0>
             4  user              STRING_HASH  False        <function parse_schema_line.<locals>.<lambda> at 0x7f8692084e50>
             5  scheduling class  INTEGER      False  

## Starting Spark

In [2]:
from pyspark import SparkContext

# start spark with 1 worker thread
sc = SparkContext("local[*]")
sc.setLogLevel("ERROR")

22/12/01 10:47:15 WARN Utils: Your hostname, Kixus-k resolves to a loopback address: 127.0.1.1; using 130.190.49.193 instead (on interface wlp4s0)
22/12/01 10:47:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/01 10:47:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Alternate approach with SparkSQL, allowing the use of DataFrames:

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local(*]") \
    .appName("Cluster analysis") \
    .getOrCreate()

## Opening and pre-processing machine events

This file describes events that occurred to the machines in the cluster, as well as their specifications (initial events).

First, let us see the schema of this file's data:

In [4]:
print(tabulate(schemas['machine_events']['fields'],headers='keys'))

  field number  content      format       mandatory    formatter
--------------  -----------  -----------  -----------  ----------------------------------------------------------------
             0  time         INTEGER      True         <function parse_schema_line.<locals>.<lambda> at 0x7f86924039a0>
             1  machine ID   INTEGER      True         <function parse_schema_line.<locals>.<lambda> at 0x7f8692403c70>
             2  event type   INTEGER      True         <function parse_schema_line.<locals>.<lambda> at 0x7f8692403be0>
             3  platform ID  STRING_HASH  False        <function parse_schema_line.<locals>.<lambda> at 0x7f8692403eb0>
             4  CPUs         FLOAT        False        <function parse_schema_line.<locals>.<lambda> at 0x7f8692403d90>
             5  Memory       FLOAT        False        <function parse_schema_line.<locals>.<lambda> at 0x7f8692438040>


In [5]:
machine_events = sc.textFile("data/machine_events/part-00000-of-00001.csv")
machine_events

data/machine_events/part-00000-of-00001.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

This simple approach works, but it would be nice to have the tabular methods provided by DataFrames. Fortunately, Spark offers a DataFrame API, through their Spark SQL. To [load a CSV as a DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv),

In [17]:
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, TimestampNTZType, FloatType, IntegerType, LongType
field_to_spark_type = {
    'time': TimestampNTZType,
    'machine ID': LongType,
    'event type': IntegerType,
    'platform ID': StringType,
    'CPUs': FloatType,
    'Memory': FloatType
}
machine_events_schema = StructType([
    StructField(field['content'], field_to_spark_type[field['content']](), field['mandatory']) for field in schemas['machine_events']['fields']
])
machine_events = spark.read.csv("../data/machine_events/part-00000-of-00001.csv", schema = machine_events_schema)
for elem in machine_events.take(5):
	print(elem)

Py4JJavaError: An error occurred while calling o81.csv.
: scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:240)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:236)
	at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1890)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:197)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:192)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:73)
	at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:81)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
	at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:444)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:537)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
