Universität Heidelberg
Distributed Systems I (IVS1)
Winter Semester 18/19

- Duc Anh Phi
- Michael Tabachnik
- Edgar Brotzmann


# Solutions to Problem Set 5 for lecture Distributed Systems I (IVS1)
## Due: 27.11.2018, 2pm

### Exercise 1

#### a)

##### What are the vector data type used for?

The vector data type is used for representing data locally on the machine.

##### What is the difference between the representation used on sparse and dense vector?

A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values, corresponding to nonzero entries. The sparse vector is prefered over a dense vector if the represented data has a lot of zero entries, since it ignores zero entries which saves memory space.

#### b)

##### Come up with 2 new features - why would such new features might be relevant for the prediction?

Number of rooms per person: We assume the more rooms a inhabitant can afford it is more likely that the house value is going to be higher.

Number of people per household: The living density of a household might be a good indicator for the prediction.


#### c)

##### Run code with newly selected features.

In [2]:
import gzip
import findspark
import re

findspark.init("/usr/local/spark")

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.linalg import DenseVector
from pyspark.ml.feature import StandardScaler
from pyspark.ml.regression import LinearRegression

In [3]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Solutions to Problem Set 5 IVS1") \
    .config("spark.executor.memory", "1gb") \
    .getOrCreate()

sc = spark.sparkContext

In [4]:
def createDataframe():
    rdd = sc.textFile('cal_housing.data')
    header = sc.textFile('cal_housing.domain')
    df = rdd \
    .map(lambda line: line.split(',')) \
    .map(lambda line: Row(
        longitude=line[0], 
        latitude=line[1], 
        housingMedianAge=line[2],
        totalRooms=line[3],
        totalBedRooms=line[4],
        population=line[5], 
        households=line[6],
        medianIncome=line[7],
        medianHouseValue=line[8])) \
    .toDF()  
    return df

def convertColumns(df, names, newType):
    for name in names:
        df = df.withColumn(name, df[name].cast(newType))
    return df

columns = [
    'households',
    'housingMedianAge',
    'latitude',
    'longitude',
    'medianHouseValue',
    'medianIncome',
    'population',
    'totalBedRooms',
    'totalRooms'
]

df = createDataframe()
df = convertColumns(df, columns, FloatType())

In [5]:
# Adjust the values of `medianHouseValue`
df = df.withColumn("medianHouseValue", col("medianHouseValue")/100000)

# Add the new columns to `df`
df = df \
    .withColumn("roomsPerPerson", col("totalRooms")/col("population")) \
    .withColumn("personPerHousehold", col("population")/col("households"))

# Re-order and select columns
df = df.select(
    "medianHouseValue", 
    "totalBedRooms", 
    "population", 
    "households", 
    "medianIncome", 
    "personPerHousehold", 
    "roomsPerPerson"
)

In [6]:
# Define the `input_data` 
input_data = df.rdd.map(lambda x: (x[0], DenseVector(x[1:])))

# Replace `df` with the new DataFrame
df = spark.createDataFrame(input_data, ["label", "features"])

In [7]:
# Initialize the `standardScaler`
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")

# Fit the DataFrame to the scaler
scaler = standardScaler.fit(df)

# Transform the data in `df` with the scaler
scaled_df = scaler.transform(df)

In [8]:
# Split the data into train and test sets
train_data, test_data = scaled_df.randomSplit([.8,.2],seed=1234)

In [9]:
# Initialize `lr`
lr = LinearRegression(labelCol="label", maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the data to the model
linearModel = lr.fit(train_data)

In [10]:
# Generate predictions
predicted = linearModel.transform(test_data)

# Extract the predictions and the "known" correct labels
predictions = predicted.select("prediction").rdd.map(lambda x: x[0])
labels = predicted.select("label").rdd.map(lambda x: x[0])

# Zip `predictions` and `labels` into a list
predictionAndLabel = predictions.zip(labels).collect()

# Print out first 5 instances of `predictionAndLabel` 
predictionAndLabel[:5]

# Get the RMSE
print('RMSE: {}'.format(linearModel.summary.rootMeanSquaredError))

# Get the R2
print('R2: {}'.format(linearModel.summary.r2))

# Get the coefficients
print('Coefficients: {}'.format(linearModel.coefficients))

# RMSE: 0.8764169349058016
# R2: 0.42297586848753665
# Coefficients: [0.0,0.0,0.0,0.27982347885597,0.0,0.0]

RMSE: 0.8764169349058016
R2: 0.42297586848753665
Coefficients: [0.0,0.0,0.0,0.27982347885597,0.0,0.0]


##### What features have most influence on the prediction?
The median income had the most influence on the prediction. If you look at the coefficients the model only used this feature, ignoring all other features: the coefficient for the median income has a high value whereas all other coefficients are zero.

##### How can one measure the quality of a linear regression model prediction?
One can print out the root mean squared error (RMSE), which represents the difference between the predicted and known value. The smaller an RMSE value, the better the quality of the model.

Additionally one can print out the coefficient of the determination (R2), which represents how close the data are to the fitted regression line. The higher R2, the better the model fits the data.

#### d)
##### Modify current implementation to use spark pipelines.

In [14]:
from pyspark.ml import Pipeline

standScaler = StandardScaler(inputCol="features", outputCol="features_scaled")
linearReg = LinearRegression(labelCol="label", maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[standScaler, linearReg])

model = pipeline.fit(df)

# Generate predictions
predicted = model.transform(df)

DataFrame[label: double, features: vector, features_scaled: vector, prediction: double]


#### e)

##### Take a look at the example code on how to use CrossValidation and incorporate it into your code.

In [17]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

paramGrid = ParamGridBuilder() \
    .addGrid(linearReg.regParam, [0.3, 0.1, 0.01]) \
    .addGrid(linearReg.fitIntercept, [False, True]) \
    .addGrid(linearReg.elasticNetParam, [0.0, 0.5, 0.8, 1.0]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(),
                          numFolds=2)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(df)

#### f)

##### Run your model with the crossvalidation algorithm. What is the best model configuration?

In [28]:
bestModel = cvModel.bestModel.stages[-1]

# Get the RMSE
print('RMSE: {}'.format(bestModel.summary.rootMeanSquaredError))

# Get the R2
print('R2: {}'.format(bestModel.summary.r2))

# Get the coefficients
print('Coefficients: {}'.format(bestModel.coefficients))

# Performance after tuning
# RMSE: 0.8058348969098945
# R2: 0.5123204275213011
# Coefficients: [
#    -4.9584499901162605e-05,
#    -0.0004184279464709144,
#    0.0013391432868942483,
#    0.41127157591553165,
#    0.0004510329315685896,
#    0.00460736475294319
#]

# Performance before tuning
# RMSE: 0.8764169349058016
# R2: 0.42297586848753665
# Coefficients: [0.0,0.0,0.0,0.27982347885597,0.0,0.0]

# As you can see after the tuning our model performs better.
# It takes every input feature into account instead of ignoring
# all except for the median income feature.
# This results into a lower RMSE and a higher R2 value for the tuned model.


RMSE: 0.8058348969098945
R2: 0.5123204275213011
Coefficients: [-4.9584499901162605e-05,-0.0004184279464709144,0.0013391432868942483,0.41127157591553165,0.0004510329315685896,0.00460736475294319]


### Exercise 2

#### a) What strengths and weaknesses do SOAP/WSDL-based web services have?

#### b) What is your own opinion with regards to strengths and weaknesses of RESTful Web Services?

#### c) Which decisions do software architects / developers have to make in the case of SOAP/WSDL-based web services and which ones in the case of RESTful web services?

### Exercise 3

#### A realtime online multiplayer game:
- realtime data transfer
- custom application layer protocol on top of tcp/ip

#### Incorporate current exchange rates for international currencies in a desktop calculator app

#### A cross-platform distributed file system

#### A gaming platform where human and AI contestants are supposed to compete in turn-based games (like chess) that are ruled and recorded by a central server

#### A service that replies to messages including a bitmap image with a set of labels that apply for the content of the image

#### A distributed computing platform that runs on a diverse grid of machines

### Exercise 4

Why simple solutions tend to perdude in industry when complex approaches seem to dominate academia