In [20]:
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import os
import time

MASTER_URL = "spark://10.4.0.20:7077"

os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'

conf = (SparkConf()
        .set("spark.eventLog.enabled", "true")
        .set("spark.driver.host", "10.4.0.20")
        .set("spark.history.fs.logDirectory", "/tmp/spark-events")
        .set("spark.app.name", "step2-datacollector-extraction")
        .set("spark.driver.memory", "6G")
        .set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.4.0") # mongo connector
        .setMaster(MASTER_URL))

# Analysis 1 - DataCollector Data Gathering
## Raw-Spark solution

With a correct Spark environment, we can finally interact with DataCollector to gather InterSCity's data. We will first list the Raw-Spark solution, explaining the key code used. The parameters used are:
- `default_uri` is just the DataCollector MongoDB's host. 
- `data_collector_development` database name
- `sensor_values` mongo collection
- `27017`  mongo port

Then, we define a `pipeline` to be used in MongoDB extraction. This parameter is important for heterogenous DataCollector environments: if the DataCollector has stored different capabilities, the pipeline filter out the capabilities that are not important for you, so that the query is executed faster.

Right next we define the city traffic `schema` - the schema of the simulation traffic is defined.

With all these things defined we can finally extract the data, which occur with the command `load`. The simulation only generates data points, that specify at which node a given car was at some timestamp.

The DataFrame `df` contains the raw simulation data - you can test it, if you want.

We choose to also store the dataframe as a parquet file because, as you can see, without it we only apply **transformations** to our dataframe. Since Spark uses lazy evaluation, without an **action** it will not run anything; since writing to a file is an action, it will trigger Spark processing so that we can finally measure its performance.

#### Scenario 0 - 1% of available data
#### Scenario 1 - 10% of available data
#### Scenario 2 - 50% of available data
#### Scenario 3 - 95% of available data
#### Scenario 4 - All available data

the total of entries are **33466742** and the mongo file exported has ~12Gb.

In [21]:
processing_times = []

In [18]:
# spark.stop()
# sc.stop()

conf.set("spark.app.name", "analysis1-rawspark-scenario1")

<pyspark.conf.SparkConf at 0x7f4a8bd81198>

In [23]:
import time
total = 33466742 

samples = [str(int(total*0.01)), str(int(total*0.1)), str(int(total*0.5)), str(int(total*0.95))]


for idx, u in enumerate(samples):
    conf.set("spark.app.name", "analysis1-rawspark-scenario{0}".format(idx))
    t0 = time.time()

    sc = SparkContext(conf=conf)
    spark = SparkSession(sc)

    import time

    from pyspark.sql.types import LongType, StringType, StructType

    capability = "city_traffic"
    default_uri = "mongodb://10.4.0.20:27017/data_collector_development"
    default_collection = "sensor_values"
    pipeline = "[{'$match': {'capability': '"+capability+"'}}, {'$limit': "+u+"}]"

    sch = (StructType()
        .add("nodeID", LongType())
        .add("tick", LongType())
        .add("uuid", StringType()))

    df = (spark
        .read
        .format("com.mongodb.spark.sql.DefaultSource")
        .option("spark.mongodb.input.uri", "{0}.{1}".format(default_uri, default_collection))
        .option("pipeline", pipeline)
        .schema(sch)
        .load()
        .withColumnRenamed("nodeID", "U")
        .withColumnRenamed("tick", "T0"))

    (df
        .write
        .format("parquet")
        .mode("overwrite")
        .save("/tmp/dataprocessor-report/step2.parquet"))

    t1 = time.time()

    spark.stop()
    sc.stop()
    
    processing_times.append(t1-t0)

In [26]:
import time

conf.set("spark.app.name", "analysis1-rawspark-scenario4")
t0 = time.time()

sc = SparkContext(conf=conf)
spark = SparkSession(sc)

import time

from pyspark.sql.types import LongType, StringType, StructType

capability = "city_traffic"
default_uri = "mongodb://10.4.0.20:27017/data_collector_development"
default_collection = "sensor_values"
pipeline = "{'$match': {'capability': '"+capability+"'}}"

sch = (StructType()
    .add("nodeID", LongType())
    .add("tick", LongType())
    .add("uuid", StringType()))

df = (spark
    .read
    .format("com.mongodb.spark.sql.DefaultSource")
    .option("spark.mongodb.input.uri", "{0}.{1}".format(default_uri, default_collection))
    .option("pipeline", pipeline)
    .schema(sch)
    .load()
    .withColumnRenamed("nodeID", "U")
    .withColumnRenamed("tick", "T0"))

(df
    .write
    .format("parquet")
    .mode("overwrite")
    .save("/tmp/dataprocessor-report/step2.parquet"))

t1 = time.time()

spark.stop()
sc.stop()

processing_times.append(t1-t0)

In [27]:
processing_times

[170.0696005821228,
 176.61079216003418,
 176.28682684898376,
 180.43790364265442,
 176.3809745311737]

Ok, the five scenarios have similar results, even decreasing the data size. That result shows that MongoDB Connector does not use the limit parameter accordingly. This is probably not the case with other operations as we shall see.

In [29]:
processor_processing_times = []

## DataProcessor Solution

In [32]:
t0 = time.time()

import requests
DATAPROCESSOR_URL = "http://localhost:4000"
headers = {'Content-type': 'application/vnd.api+json'}

response = requests.post(DATAPROCESSOR_URL+"/api/processing_jobs/6/run", headers=headers)
print(response.text)

t1 = time.time()

processor_processing_times.append(t1-t0)

{"jsonapi":{"version":"1.0"},"included":[{"type":"job-script","id":"5","attributes":{"title":"Extract Collector","path":"collectorsource.py","language":"python","code-sample":"from pyspar\nfrom pyspar\nfrom pyspar\nimport requ\nimport os\nfrom pyspar\n\nimport sys\n\nif __name__\n    # Loadi","code":"from pyspark.sql.types import StructType, StructField, ArrayType, StringType, DoubleType, IntegerType, DateType\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import explode, col\nimport requests\nimport os\nfrom pyspark import SparkContext, SparkConf\n\nimport sys\n\nif __name__ == '__main__':\n    # Loading the dataset\n    my_uuid = str(sys.argv[1])\n    os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'\n    url = \"http://localhost:4000\" + '/api/job_templates/{0}'.format(my_uuid)\n    response = requests.get(url)\n    params = response.json()[\"data\"][\"attributes\"][\"user-params\"]\n\n    functional_params = params[\"functional\"]\n    capability = params[\"inter

In [34]:
t0 = time.time()

import requests
DATAPROCESSOR_URL = "http://localhost:4000"
headers = {'Content-type': 'application/vnd.api+json'}

response = requests.post(DATAPROCESSOR_URL+"/api/processing_jobs/7/run", headers=headers)
print(response.text)

t1 = time.time()

processor_processing_times.append(t1-t0)

{"jsonapi":{"version":"1.0"},"included":[{"type":"job-script","id":"5","attributes":{"title":"Extract Collector","path":"collectorsource.py","language":"python","code-sample":"from pyspar\nfrom pyspar\nfrom pyspar\nimport requ\nimport os\nfrom pyspar\n\nimport sys\n\nif __name__\n    # Loadi","code":"from pyspark.sql.types import StructType, StructField, ArrayType, StringType, DoubleType, IntegerType, DateType\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import explode, col\nimport requests\nimport os\nfrom pyspark import SparkContext, SparkConf\n\nimport sys\n\nif __name__ == '__main__':\n    # Loading the dataset\n    my_uuid = str(sys.argv[1])\n    os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'\n    url = \"http://localhost:4000\" + '/api/job_templates/{0}'.format(my_uuid)\n    response = requests.get(url)\n    params = response.json()[\"data\"][\"attributes\"][\"user-params\"]\n\n    functional_params = params[\"functional\"]\n    capability = params[\"inter

In [42]:
t0 = time.time()

import requests
DATAPROCESSOR_URL = "http://localhost:4000"
headers = {'Content-type': 'application/vnd.api+json'}

response = requests.post(DATAPROCESSOR_URL+"/api/processing_jobs/10/run", headers=headers)
print(response.text)

t1 = time.time()

processor_processing_times.append(t1-t0)

{"jsonapi":{"version":"1.0"},"included":[{"type":"job-script","id":"10","attributes":{"title":"Extract Collector","path":"collectorsource.py","language":"python","code-sample":"from pyspar\nfrom pyspar\nfrom pyspar\nimport requ\nimport os\nfrom pyspar\n\nimport sys\n\nif __name__\n    # Loadi","code":"from pyspark.sql.types import StructType, StructField, ArrayType, StringType, DoubleType, IntegerType, DateType\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import explode, col\nimport requests\nimport os\nfrom pyspark import SparkContext, SparkConf\n\nimport sys\n\nif __name__ == '__main__':\n    # Loading the dataset\n    my_uuid = str(sys.argv[1])\n    os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'\n    url = \"http://localhost:4000\" + '/api/job_templates/{0}'.format(my_uuid)\n    response = requests.get(url)\n    params = response.json()[\"data\"][\"attributes\"][\"user-params\"]\n\n    functional_params = params[\"functional\"]\n    capability = params[\"inte

In [45]:
processor_processing_times

[189.7351520061493, 190.39482831954956, 195.41603088378906]

In [46]:
t0 = time.time()

import requests
DATAPROCESSOR_URL = "http://localhost:4000"
headers = {'Content-type': 'application/vnd.api+json'}

response = requests.post(DATAPROCESSOR_URL+"/api/processing_jobs/11/run", headers=headers)
print(response.text)

t1 = time.time()

processor_processing_times.append(t1-t0)

{"jsonapi":{"version":"1.0"},"included":[{"type":"job-script","id":"11","attributes":{"title":"Extract Collector","path":"collectorsource.py","language":"python","code-sample":"from pyspar\nfrom pyspar\nfrom pyspar\nimport requ\nimport os\nfrom pyspar\n\nimport sys\n\nif __name__\n    # Loadi","code":"from pyspark.sql.types import StructType, StructField, ArrayType, StringType, DoubleType, IntegerType, DateType\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import explode, col\nimport requests\nimport os\nfrom pyspark import SparkContext, SparkConf\n\nimport sys\n\nif __name__ == '__main__':\n    # Loading the dataset\n    my_uuid = str(sys.argv[1])\n    os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'\n    url = \"http://localhost:4000\" + '/api/job_templates/{0}'.format(my_uuid)\n    response = requests.get(url)\n    params = response.json()[\"data\"][\"attributes\"][\"user-params\"]\n\n    functional_params = params[\"functional\"]\n    capability = params[\"inte

In [49]:
processor_processing_times

[189.7351520061493, 190.39482831954956, 195.41603088378906, 191.66101503372192]

In [50]:
t0 = time.time()

import requests
DATAPROCESSOR_URL = "http://localhost:4000"
headers = {'Content-type': 'application/vnd.api+json'}

response = requests.post(DATAPROCESSOR_URL+"/api/processing_jobs/12/run", headers=headers)
print(response.text)

t1 = time.time()

processor_processing_times.append(t1-t0)

{"jsonapi":{"version":"1.0"},"included":[{"type":"job-script","id":"12","attributes":{"title":"Extract Collector","path":"collectorsource.py","language":"python","code-sample":"from pyspar\nfrom pyspar\nfrom pyspar\nimport requ\nimport os\nfrom pyspar\n\nimport sys\n\nif __name__\n    # Loadi","code":"from pyspark.sql.types import StructType, StructField, ArrayType, StringType, DoubleType, IntegerType, DateType\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import explode, col\nimport requests\nimport os\nfrom pyspark import SparkContext, SparkConf\n\nimport sys\n\nif __name__ == '__main__':\n    # Loading the dataset\n    my_uuid = str(sys.argv[1])\n    os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'\n    url = \"http://localhost:4000\" + '/api/job_templates/{0}'.format(my_uuid)\n    response = requests.get(url)\n    params = response.json()[\"data\"][\"attributes\"][\"user-params\"]\n\n    functional_params = params[\"functional\"]\n    capability = params[\"inte

In [52]:
processor_processing_times

[189.7351520061493,
 190.39482831954956,
 195.41603088378906,
 191.66101503372192,
 198.37721586227417]

In [53]:
processing_times

[170.0696005821228,
 176.61079216003418,
 176.28682684898376,
 180.43790364265442,
 176.3809745311737]

The names are the following

|solution |scenario |appname |processing_time|n (size)| n (total %)| n (total rows) |
|-----|------|------|-------|--------|----|----|
|raw-spark|0|app-20190424172036-0003|170.069s|10e6|1%|334667|
|raw-spark|1|app-20190424172327-0004|176.610s|10e7|10%|3346674|
|raw-spark|2|app-20190424172623-0005|176.286s|10e8|50%|16733371|
|raw-spark|3|app-20190424172920-0006|180.437s|10e9|95%|31793404|
|raw-spark|4|app-20190424173428-0007|176.380s|10e9|100%|33466742 |
|dataprocessor|0|app-20190424180608-0010|189.735s|10e6|1%|334667|
|dataprocessor|1|app-20190424181149-0011|190.394s|10e7|10%|3346674|
|dataprocessor|2|app-20190424182910-0014|195.416s|10e8|50%|16733371|
|dataprocessor|3|app-20190424183358-0015|191.661s|10e9|95%|31793404|
|dataprocessor|4|app-20190424184747-0016|198.377s|10e9|100%|33466742 |


# Wrapping Up

Ok, so actually DataProcessor is consistently slower than raw spark by a low margin that becomes more and more insignificant with bigger volume of data. However, a new overhead was discovered by this results: translating data from bash (log result from SparkSubmit) to Elixir and then adding this log to the response request may add significant delay (7secs is something relevant for web).