# Spark Machine learning - Spark-MLlib

As organizations create more diverse and more user-focused data products and services, there is a growing need for machine learning, which can be used to develop personalizations, recommendations, and predictive insights. The Apache Spark machine learning library (MLlib) allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on).

From the inception of the Apache Spark project, MLlib was considered foundational for Spark’s success. The key benefit of MLlib is that it allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on). The data engineers can focus on distributed systems engineering using Spark’s easy-to-use APIs, while the data scientists can leverage the scale and speed of Spark core. Just as important, Spark MLlib is a general-purpose library, providing algorithms for most use cases while at the same time allowing the community to build upon and extend it for specialized use cases.

This week we will look at the basics of some of the spark MLlib packages, next week we build these into machine learning pipelines.

We will start with a databricks database that compares city population to median sale prices of homes.

In [0]:
import pyspark
from pyspark.sql.functions import col
# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
data.cache() # Cache data for faster reuse
data.count()

Take a look at the data schema.

In [0]:
data.printSchema()
data.take(10)

Here we clean the data by dropping any rows with missing values.

In [0]:
data = data.dropna() # drop rows with missing values
data.count()

Next we select the columns we will use to predict house prices (features) and the correct output (labels). Below we use the databricks display function which can be useful for examining our dataframes.

In [0]:
dataFeat = data.select(["2014 Population estimate","2015 median sales price"])
display(dataFeat)

Finally we need to convert the data to the format expected by spark, a vector of features as a single column and the label as an additional column.

We define a function that does this (which we can use later). As you can see it takes our DataFrame and converts it to an RDD. This then allows us to apply the map transformation to each entry of the RDD. Here we take all the columns except the last (in our case only one) and convert these to a dense vector, we keep the last column seperate for the label.

Vectors (dense or sparse) are a way of holding data that is used a lot in machine learning. In the features column below we only have one data item (the size of the population) but as you can see the data structure is as follows: [1,1,[],[188226]] this format makes more sense when we use sparse vectors (it easier to see the data is one of the advantages, but mostly its is more space efficient) for more information this video has a short explanation of the vector format: https://www.youtube.com/watch?v=oGwEv82ifrE

Here is the same vector in three formats:
```python
##### Use a NumPy array as a dense vector.
dv1 = np.array([1.0, 0.0, 3.0])
##### Use a Python list as a dense vector.
dv2 = [1.0, 0.0, 3.0]
##### Create a SparseVector.
```
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0]) ```
You will find out that most of the machine learning algorithms in Spark are based on the features and label format. That is to say, you can play with all of the machine learning algorithms in Spark aftre you prepare your data in this format with features and label.

The function defined below will take a row of our dataframe and convert all of the features into a dense vector. The final dataformat is a dataframe with two columns, a label column and a features column.

https://spark.apache.org/docs/latest/ml-guide.html

In [0]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

In [0]:
a = dataFeat.rdd.map(lambda x: Vectors.dense(x))
a.collect()

In [0]:
def transData(data):
  # Combine columns to a dense vector (excluding the last column)
  dataFeaturesRDD = data.rdd.map(lambda r: [Vectors.dense(r[:-1]),r[-1]])
  
  # Convert the RDD back to a DataFrame, labelling the columns
  featuresDF =  dataFeaturesRDD.toDF(['features','label'])
  
  return featuresDF

In [0]:
dataLR = transData(dataFeat)
dataLR.show()

Finally we can run our linear regression. Here we create two models with different parameters and print the coefficients they provide.

In [0]:
# Import LinearRegression class
from pyspark.ml.regression import LinearRegression

# Define LinearRegression algorithm
lr = LinearRegression()

# Fit 2 models, using different regularization parameters
modelA = lr.fit(dataLR, {lr.regParam:0.0})
modelB = lr.fit(dataLR, {lr.regParam:100.0})

# Print the fitted model parameters
print(">>>> ModelA intercept: %r, coefficient: %r" % (modelA.intercept, modelA.coefficients[0]))
print(">>>> ModelB intercept: %r, coefficient: %r" % (modelB.intercept, modelB.coefficients[0]))

Sometimes we will want to see the significance of the coefficients we have. We can check the p values of our regressors:

In [0]:
seA = modelA.summary.pValues
seB = modelB.summary.pValues

print(">>>> ModelA p value intercept: %r, p value coefficient: %r" % (seA[1], seA[0]))
print(">>>> ModelB p value intercept: %r, p value coefficient: %r" % (seB[1], seB[0]))

Generating predictions based on the model is easy, here we use the input data and compare model predicted to actual values.

In [0]:
predictionsA = modelA.transform(dataLR)
predictionsB = modelB.transform(dataLR)

In [0]:
display(predictionsA)

In [0]:
display(predictionsB)

The predictions dont look too great, in databricks we can easily plot the residuals. You can drag the chart to make it bigger.

In [0]:
display(modelB, dataLR)
#they should have a random distribution around 0

Additionaly, Spark includes functionality to help us with our models, for example the regression evaluator.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse")
RMSE_A = evaluator.evaluate(predictionsA)
RMSE_B = evaluator.evaluate(predictionsB)

print("ModelA: Root Mean Squared Error = " + str(RMSE_A))
print("ModelB: Root Mean Squared Error = " + str(RMSE_B))

#Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. 

According to RMSE ModelA performs better.


### Practice exercise

In the below two cells we download a database of car performance (efficiency in miles per gallon MPG) and a number of car characteristics. Use the linear regression estimator to try to predict miles per gallon MPG based on:

 - Just <b> weight </b> as a feature

 - <b> displacement, weight, acceleration, and year </b> as features

Compare the RMSE of your the two aproaches and plot the residuals.

Hint: the `transData` function we defined is very useful, however, it requires the final column to be the data we are trying to predict. You will need to either change this function or re-arrange the dataframe, as current, MPG is the first column.

In [0]:
%sh wget https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv

In [0]:
dataMPG = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("file:/databricks/driver/auto-mpg.csv")
dataMPG.cache()
print(dataMPG.count())
dataMPG.printSchema()

In [0]:
display(dataMPG.select([dataMPG.columns[2], dataMPG.columns[4], dataMPG.columns[5],  dataMPG.columns[6], dataMPG.columns[0]]))

In [0]:
#create the vector data1


In [0]:
#create vector data2


In [0]:
#fit models and create predictions


In [0]:
#Print intercept and coefficient


In [0]:
#Print RMSE's

### Gradient Boosted Tree Regression

Lets try another model from the spark toolkit on our data: Gradient Boosted Tree Regression https://statisticasoftware.wordpress.com/2012/09/11/boosting-trees-for-regression-and-classification/

Follow the steps below to see how easily we can apply a different machine learning approach to our data.

In [0]:
# Import LinearRegression class
from pyspark.ml.regression import GBTRegressor

# Define LinearRegression algorithm
rf = GBTRegressor(maxDepth=2, seed=42) #1st try: maxDepth=2, seed=42 | 2nd try: maxDepth=2, seed=42 
dataMPGOrdered = transData(dataMPG.select(dataMPG.columns[1:] + [dataMPG.columns[0]]  ))
rf.setMaxIter(10) #1st try: 10 | 2nd try: 30
rf.setMinWeightFractionPerNode(0.1)  #1st try: 0.1 | 2nd try: 0.005
modelG = rf.fit(dataMPGOrdered)

In [0]:
predictions = modelG.transform(dataMPGOrdered)
predictions.select("features","label", "prediction").show(5)

In [0]:
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

#1st 3.12151  | 2nd 0.847873

How did this approach perform in comparison to the previous regressions? Remember we are only looking at RMSE on the training data.

This model is prone for overfitting, below we split the data into a training and test dataset. This would be standard good practice if this is all of the data we have.

Fit the model to the training dataset and test the fit against the test dataset.

In [0]:
(trainingData, testData) = dataMPGOrdered.randomSplit([0.6, 0.4])

trainingData.show(5)
testData.show(5)

In [0]:
modelGTr = rf.fit(trainingData)
predictionsGTr = modelGTr.transform(testData)
rmseGtr = evaluator.evaluate(predictionsGTr)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmseGtr)

## Bonus exercise: Machine learning practice

I've included below a dataset of wine characteristics and quality, you can use this to practice applying the aprroachs above to build a predictor of wine quality.

Try to find the best model you can to predict wine quality.

If you want, you can try other machine learning models, see: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#regression

##### Below I fit two basic models to the data, based on the methods we have seen in this lab, and compare them. However, there is a lot more you could do.

In [0]:
%sh wget https://raw.githubusercontent.com/MingChen0919/learning-apache-spark/master/data/WineData.csv

In [0]:
wineDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("file:/databricks/driver/WineData.csv")
wineDF.cache()
wineDF.printSchema()
wineDF.select("quality").distinct().take(100)

In [0]:
# Convert to float format
def string_to_float(x):
    return float(x)

#
def condition(r):
    if (0<= r <= 4):
        label = "low"
    elif(4< r <= 6):
        label = "medium"
    else:
        label = "high"
    return label
  
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, DoubleType
string_to_float_udf = udf(string_to_float, DoubleType())
quality_udf = udf(lambda x: condition(x), StringType())
wineDFQ = wineDF.withColumn("quality", quality_udf("quality"))

In [0]:
# Suggested steps:
#   1. Look the dataset
#   2. Create the vector
#   3. Split train e test
#   4. Predict
#   5. Look the RMSE