# Exploratory code

The software in this file represents some of our explorations and experiments that were used to make our final analysis.

## Parsing the schema file

In [None]:
import schema
from tabulate import tabulate
from importlib import reload
reload(schema)

schemas = schema.data_schemas_from_file("../data/schema.csv")

In [None]:
for key,value in schemas.items():
    print("#########################################################################")
    print(f"{key} ({value['file pattern']}):")
    print(tabulate(value['fields'],headers='keys'))

## Starting Spark

In [None]:
from pyspark import SparkContext

# start spark with 1 worker thread
sc = SparkContext("local[*]")
sc.setLogLevel("ERROR")

Alternate approach with SparkSQL, allowing the use of DataFrames:

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local(*]") \
    .appName("Cluster analysis") \
    .getOrCreate()

## Opening and pre-processing machine events

This file describes events that occurred to the machines in the cluster, as well as their specifications (initial events).

First, let us see the schema of this file's data:

In [None]:
print(tabulate(schemas['machine_events']['fields'],headers='keys'))

### With DataFrames

It would be nice to have the tabular methods provided by DataFrames. Fortunately, Spark offers a DataFrame API, through their Spark SQL. To [load a CSV as a DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv),

In [None]:
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, FloatType, IntegerType, LongType
field_to_spark_type = {
    'time': LongType,
    'machine ID': LongType,
    'event type': IntegerType,
    'platform ID': StringType,
    'CPUs': FloatType,
    'Memory': FloatType
}
machine_events_schema = StructType([
    StructField(field['content'].replace(' ', '_'), field_to_spark_type[field['content']](), field['mandatory']) for field in schemas['machine_events']['fields']
])

machine_events = spark.read \
    .format('csv') \
    .option("header","true") \
    .schema(machine_events_schema) \
    .load("../data/machine_events/part-00000-of-00001.csv")

Let us check that the schema is as we expect

In [None]:
machine_events.printSchema()

And inspect the first few data

In [None]:
for elem in machine_events.take(5):
	print(elem.asDict())

How much events do we have?

In [None]:
machine_events.count(), machine_events.filter(machine_events.event_type == 1).count(),machine_events.filter(machine_events.event_type == 2).count()

In [None]:
machine_events.filter(machine_events.event_type == 0).CPUs

### Going back to RDD

Can't map on DataFrames, DataFrames are weird!!!

In [None]:
machine_events_schema = schemas['machine_events']
machine_events = sc.textFile("../data/machine_events/part-00000-of-00001.csv").map(lambda row: schema.format_row(machine_events_schema, row.split(',')))
for elem in machine_events.take(5):
	print(elem)

For the following, we only account for the machine creation events. We note that this method could count a given machine more than once, should it happen to be added more than once (and removed in-between) or modified.

Distribution of machine capacity based on their CPU power:

In [None]:
from operator import add

cpu = schema.index_of_field(machine_events_schema, 'CPUs')
event = schema.index_of_field(machine_events_schema, 'event type')

cpu_usage = machine_events \
    .filter(lambda row: row[event] == 0) \
    .map(lambda row: (row[cpu], 1)) \
    .reduceByKey(add)

cpu_usage.foreach(lambda cpu: print(f"{cpu[1]} machines have CPU {cpu[0]}"))

In [None]:
# import matplotlib.pyplot as plt
# fig = plt.figure()
# ax = fig.add_axes([0,0,1,1])
# langs = ['None', '25%', '50%', '100%']
# students = [23,17,35,29,12]
# ax.bar(langs,students)
# plt.show()
cpu_usage = cpu_usage.filter(lambda x: x[0] is not None)

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
legend = cpu_usage.map(lambda x: str(x[0]) if x[0] is not None else 'None').collect()
count = cpu_usage.sortByKey().map(lambda x: x[1]).collect()
ax.bar(legend,count)
plt.show()