### Create Dataset

When using a Domino on-demand Spark cluster, any data that will be used, created, or modified as part of the interaction must go into an external data store.

In this notebook we will be moving our data file from the imported git repo into a Domino Dataset and working with it there. 

When you create a Spark cluster attached to a Domino workspace or job, any Domino dataset accessible from the workspace or job will also be accessible from all components of the cluster under the same dataset mount path. Data can be accessed using the following path prefix:
`file:///`

For example, to read a file you would use the following:
`rdd = sc.textFile("file:///path/to/file")`


To read from other data sources, see our docs [here](https://docs.dominodatalab.com/en/latest/user_guide/a3b42e/work-with-data/). 

In [None]:
import os

dataset_path = '/domino/datasets/local/' + os.environ['DOMINO_PROJECT_NAME'] + '/'

In [None]:
#set up data location 

#change the following line if you are not using the spark-quickstart-winequality repo as a Domino imported repo 
file_path = '/repos/spark-quickstart-winequality/wine_quality.csv'

!cp $file_path $dataset_path

In [None]:
new_file_path = dataset_path + '/wine_quality.csv'

In [None]:
#Setup Pyspark
from pyspark.ml import Pipeline
from pyspark.sql.functions import mean,col,split, col, regexp_extract, when, lit
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import QuantileDiscretizer

from pyspark.sql import SparkSession

spark = SparkSession \
        .builder \
        .appName("WineQualityApp") \
        .getOrCreate()

In [None]:
#Setup and Test Spark Context
sc = spark.sparkContext
sc

In [None]:
#Read Data
red_wine = spark.read.format('csv').options(header='true', inferSchema='true',sep=';').load(data_file)

In [None]:
#Profile Data
red_wine.printSchema()
print("Rows: %s" % red_wine.count())

In [None]:
#Feature Extraction into Vectors
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# select the columns to be used as the features (all except `quality`)
featureColumns = [c for c in red_wine.columns if c != 'quality']

# create and configure the assembler
assembler = VectorAssembler(inputCols=featureColumns, 
                            outputCol="features")

# transform the original data
dataDF = assembler.transform(red_wine)
dataDF.printSchema()

In [None]:
dataDF.show()

In [None]:
from pyspark.ml.regression import LinearRegression

# fit a `LinearRegression` model using features in colum `features` and label in column `quality`
lr = LinearRegression(maxIter=30, regParam=0.3, elasticNetParam=0.3, featuresCol="features", labelCol="quality")
lrModel = lr.fit(dataDF)

In [None]:
for t in zip(featureColumns, lrModel.coefficients):
    print(t)

In [None]:
# predict the quality, the predicted quality will be saved in `prediction` column
predictionsDF = lrModel.transform(dataDF)
display(predictionsDF.show())

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

# create a regression evaluator with RMSE metrics

evaluator = RegressionEvaluator(
    labelCol='quality', predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictionsDF)
print("Root Mean Squared Error (RMSE) = %g" % rmse)

In [None]:
# split the input data into traning and test dataframes with 70% to 30% weights
(trainingDF, testDF) = red_wine.randomSplit([0.7, 0.3])

In [None]:
from pyspark.ml import Pipeline

# construct the `Pipeline` that with two stages: the `vector assembler` and `regresion model estimator`
pipeline = Pipeline(stages=[assembler, lr])

# train the pipleline on the traning data
lrPipelineModel = pipeline.fit(trainingDF)

# make predictions
traningPredictionsDF = lrPipelineModel.transform(trainingDF)
testPredictionsDF = lrPipelineModel.transform(testDF)

# evaluate the model on test and traning data
print("RMSE on training data = %g" % evaluator.evaluate(traningPredictionsDF))
print("RMSE on test data = %g" % evaluator.evaluate(testPredictionsDF))

In [None]:
#Run Random Forest
from pyspark.ml.regression import RandomForestRegressor

# define the random forest estimator
rf = RandomForestRegressor(featuresCol="features", labelCol="quality", numTrees=100, maxBins=128, maxDepth=20, \
                           minInstancesPerNode=5, seed=33)
rfPipeline = Pipeline(stages=[assembler, rf])

# train the random forest model
rfPipelineModel = rfPipeline.fit(trainingDF)

In [None]:
#Test Accuracy of Random Forest
rfTrainingPredictions = rfPipelineModel.transform(trainingDF)
rfTestPredictions = rfPipelineModel.transform(testDF)
print("Random Forest RMSE on training data = %g" % evaluator.evaluate(rfTrainingPredictions))
print("Random Forest RMSE on test data = %g" % evaluator.evaluate(rfTestPredictions))

In [None]:
#stop Spark context
sc.stop()