<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Boston House Price and </b> <span style="font-weight:bold; color:green">Spark MLlib</span></div><hr>
<div style="text-align:right;">Sergei Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Contents</span>
    <ol>
        <li><a href="#1">Initial dataset</a></li>
        <li><a href="#2">Regression and cross-validation</a></li>
        <li><a href="#3">References</a></li>
    </ol>
</div>

<p>[OPTIONAL] <b>Environment Setup</b></p>

In [None]:
import os
import sys

os.environ["SPARK_HOME"]="/usr/lib/spark"
os.environ["PYSPARK_PYTHON"]="/opt/anaconda3/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"]="/opt/anaconda3/bin/python"

spark_home = os.environ.get("SPARK_HOME")
sys.path.insert(0, os.path.join(spark_home, "python"))
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.10.7-src.zip"))

<p>Run Spark</p>

In [None]:
import pyspark
from pyspark.sql import SparkSession

In [None]:
conf = pyspark.SparkConf() \
        .setAppName("bostonApp") \
        .setMaster("yarn") \
        .set("spark.submit.deployMode", "client")

If you run **locally**:

In [None]:
conf = pyspark.SparkConf() \
        .setAppName("bostonApp") \
        .setMaster("local[2]") 

Create a Spark Session:

In [None]:
spark = SparkSession \
    .builder \
    .appName("bostonApp") \
    .config(conf=conf) \
    .getOrCreate()

In [None]:
spark

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Initial dataset</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
from pyspark.sql.types import StructType, StructField, DoubleType

<p>[IF NEEDED] Copy local input dataset to HDFS</p>

In [None]:
!hdfs dfs -copyFromLocal data/price-regression-cv-data/boston-house-price.csv data/spark_dataframe

In [None]:
!hdfs dfs -ls data/spark_dataframe

<p>Define a dataset scheme</p>

In [None]:
dataset_path = "/YOUR_PATH/data/price-regression-cv-data/boston-house-price.csv"

schema_house = StructType(
    [StructField("CRIM", DoubleType(), True), # per capita crime rate by town
     StructField("ZN", DoubleType(), True), # proportion of residential land zoned for lots over 25,000 sq.ft.
     StructField("INDUS", DoubleType(), True), # proportion of non-retail business acres per town
     StructField("CHAS", DoubleType(), True), # Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
     StructField("NOX", DoubleType(), True), # nitric oxides concentration (parts per 10 million)
     StructField("RM", DoubleType(), True), # average number of rooms per dwelling
     StructField("AGE", DoubleType(), True), # proportion of owner-occupied units built prior to 1940
     StructField("DIS", DoubleType(), True), # weighted distances to five Boston employment centres
     StructField("RAD", DoubleType(), True), # index of accessibility to radial highways
     StructField("TAX", DoubleType(), True), # full-value property-tax rate per $10,000
     StructField("PTRATIO", DoubleType(), True), # pupil-teacher ratio by town
     StructField("B", DoubleType(), True), # 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
     StructField("LSTAT", DoubleType(), True), # % lower status of the population
     StructField("MEDV", DoubleType(), True)]) # Median value of owner-occupied homes in $1000’s

<p>Create a Spark dataframe</p>

In [None]:
df_house = spark.read.load(path=dataset_path,
                           format="csv", 
                           schema=schema_house,
                           header="false", 
                           inferSchema="false", 
                           sep=",")
df_house.persist()
df_house.show(5)

<p>Display the number of rows</p>

In [None]:
df_house.count()

<p>Calculate data statistics</p>

In [None]:
df_house_stats = df_house.describe()
df_house_stats.show()

<p>Display formatted output using Pandas</p>

In [None]:
df_house_stats.toPandas().transpose()

<p><b>Draw plots for initial dataset</b></p>

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

<p>Sample data from Spark dataframe</p>

In [None]:
pd_df_house_sample = df_house.sample(False, 0.2, seed=123).toPandas()
pd_df_house_sample.head(5)

<p>Draw plots</p>

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
scatter_matrix(pd_df_house_sample, figsize=(15, 15))
plt.show()

In [None]:
plt.figure(2, figsize=[10,10])
plt.matshow(pd_df_house_sample.corr(), vmin=-1, vmax=1, fignum=2)
plt.title("Correlation")
plt.xticks(range(len(pd_df_house_sample.columns)), pd_df_house_sample.columns)
plt.yticks(range(len(pd_df_house_sample.columns)), pd_df_house_sample.columns)
plt.colorbar(fraction=0.046, pad=0.04)
plt.show()

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Regression and cross-validation</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder

Create a DataFrame

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
feature_columns = ["RM", "LSTAT"]

In [None]:
feature_assembler = VectorAssembler(inputCols=feature_columns,
                                    outputCol="Features")

df_house_with_features = feature_assembler.transform(df_house)
df_house_with_features.show(5)

In [None]:
df_features = df_house_with_features.select("Features", "MEDV")
df_features.show(5)

<p>Split the data into training and test subsets</p>

In [None]:
df_train, df_test = df_features.randomSplit([0.8, 0.2], seed=123)
df_train.count(), df_test.count()

In [None]:
df_train.show(5)

<p>Initialize a linear model</p>

<i>solver = {"l-bfgs", "normal", "auto"}</i>

In [None]:
reg_lr = LinearRegression(maxIter=100, solver="l-bfgs", featuresCol="Features", labelCol="MEDV")

<p>Cross-Validation</p>

<i>metric = {"rmse", "mse", "r2", "mae"}</i>

In [None]:
evaluator = RegressionEvaluator(metricName="rmse", predictionCol="prediction", labelCol="MEDV")

<p>Create a grid of hyperparameters</p>

In [None]:
grid = ParamGridBuilder()\
          .addGrid(reg_lr.regParam, [0.1, 0.01]) \
          .addGrid(reg_lr.fitIntercept, [False, True]) \
          .addGrid(reg_lr.elasticNetParam, [0.0, 0.5, 1.0]) \
          .build()

<p>Initialize a cross-validator</p>

In [None]:
cv = CrossValidator(estimator=reg_lr, numFolds=4, estimatorParamMaps=grid, evaluator=evaluator)

<p><b>Model selection</b></p>

<p>Run cross-validation (train the models)</p>

In [None]:
m_cv = cv.fit(df_train)
m_cv

<p>Display a list of output metrics for all combinations of hyperparameters</p>

In [None]:
m_cv.avgMetrics

<p>Display a list of model hyperparameters that were used </p>

In [None]:
m_cv.extractParamMap()

<p>Get the best model</p>

In [None]:
best_m_lr = m_cv.bestModel
best_m_lr

<p>Display the best model coefficients</p>

In [None]:
best_m_lr.coefficients

In [None]:
best_m_lr.intercept

<p>The bes model interpretation</p>

In [None]:
f_pred = lambda x1, x2: best_m_lr.intercept + best_m_lr.coefficients[0] * x1 + best_m_lr.coefficients[1] * x2

In [None]:
f_pred(6, 10)

<p>Training summary</p>

In [None]:
best_m_lr.summary.objectiveHistory

In [None]:
best_m_lr.summary.rootMeanSquaredError

In [None]:
best_m_lr.summary.totalIterations

<p><b>Test the best model</b></p>

In [None]:
df_test_pred = m_cv.transform(df_test)
df_test_pred.show(5)

<p>Set R^2 metric</p>

In [None]:
evaluator.setParams(metricName="r2")

<p>Result</p>

In [None]:
evaluator.evaluate(df_test_pred)

<p>Get a sample to draw plots</p>

In [None]:
pd_df_house_sample = df_house.select("RM", "LSTAT", "MEDV").sample(False, 0.2, seed=123).toPandas()
pd_df_house_sample.head(5)

<p>Draw plots with initial and predicted values</p>

In [None]:
plt.figure(2, figsize=[10,5])

plt.subplot(1,2,1)

plt.plot(pd_df_house_sample["RM"], 
         pd_df_house_sample["MEDV"], 
         "bo",
         label="initial")
plt.plot(pd_df_house_sample["RM"], 
         f_pred(pd_df_house_sample["RM"], pd_df_house_sample["LSTAT"]), 
         "ro", 
         label="predicted")
plt.axis([3, 10, 0, 55])
plt.title("RM-MEDV")
plt.xlabel("RM")
plt.ylabel("MEDV, $1000’s")
plt.legend()
plt.grid(True)

plt.subplot(1,2,2)

plt.plot(pd_df_house_sample["LSTAT"], 
         pd_df_house_sample["MEDV"], 
         "bo",
         label="initial")
plt.plot(pd_df_house_sample["LSTAT"], 
         f_pred(pd_df_house_sample["RM"], pd_df_house_sample["LSTAT"]), 
         "ro", 
         label="predicted")
plt.axis([3, 10, 0, 55])
plt.title("LSTAT-MEDV")
plt.xlabel("LSTAT")
plt.ylabel("MEDV, $1000’s")
plt.legend()
plt.grid(True)

<p>Stop Spark Context</p>

In [None]:
spark.stop()

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>