<a href="https://colab.research.google.com/github/Buziwe/BMAssignment/blob/master/BD_11_Spark_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [1]:
# Installing java and downloading spark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

### Upload energy_data_ml.csv

# Spark ML

We will explore fitting a regression model and k-means clustering in Spark ML. We will be analysing household energy efficiency data. We explore different building shapes and how they differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. The dataset comprises 768 samples aiming to predict the heating and cooling load on these buildings. Data from [here](http://archive.ics.uci.edu/ml/datasets/Energy+efficiency)

### Start spark app

In [2]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Load and clean data

In [3]:
data = spark.read.csv("energy_data_ml.csv",header=True)

In [5]:
#get rid of unwanted columns
data = data.drop('_c0')
#drop nas
data = data.dropna()

In [6]:
data.printSchema()

root
 |-- Relative_Compactness: string (nullable = true)
 |-- Surface_Area: string (nullable = true)
 |-- Wall_Area: string (nullable = true)
 |-- Roof_Area: string (nullable = true)
 |-- Overall_Height: string (nullable = true)
 |-- Orientation: string (nullable = true)
 |-- Glazing_Area: string (nullable = true)
 |-- Cooling_Load: string (nullable = true)



In [7]:
# covert to numeric types
#import double type from spark sql
from pyspark.sql.types import DoubleType, IntegerType

#convert all columns
for col_name in data.columns:
    data = data.withColumn(col_name, data[col_name].cast(DoubleType()))

data = data.withColumn("Orientation", data["Orientation"].cast(IntegerType()))
    
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [8]:
data.printSchema()

root
 |-- Relative_Compactness: double (nullable = true)
 |-- Surface_Area: double (nullable = true)
 |-- Wall_Area: double (nullable = true)
 |-- Roof_Area: double (nullable = true)
 |-- Overall_Height: double (nullable = true)
 |-- Orientation: integer (nullable = true)
 |-- Glazing_Area: double (nullable = true)
 |-- Cooling_Load: double (nullable = true)



### Prepare data for model

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, VectorAssembler
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator

In [None]:
# transform categorical variables to index
labelEncoder = OneHotEncoder(inputCol="Orientation", outputCol="OrientationInd")

In [None]:
# assemble variables to one feature column
assembler = VectorAssembler(
    inputCols = ['Relative_Compactness',"Surface_Area","Wall_Area","Roof_Area","Overall_Height","OrientationInd","Glazing_Area"],
    outputCol = "features")

#define the estimator - decision tree
dt = DecisionTreeRegressor(labelCol="Cooling_Load", featuresCol="features")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelEncoder, assembler, dt])

### Fit pipeline and transform data

In [None]:
#fit the pipeline
PipelineModel = pipeline.fit(trainingData)

# transform using the pipeline
predictions = PipelineModel.transform(testData)

# evaluate model fit
predictions.select("prediction", "Cooling_Load")
evaluator = RegressionEvaluator(
    labelCol="Cooling_Load", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)

In [None]:
predictions.show()

In [None]:
##Root mean square error
print(rmse)

In [None]:
#save the fitted pipeline for later use
PipelineModel.save("my_pipeline")

### Kmeans clustering

In [None]:
from pyspark.ml.clustering import KMeans

# Trains a k-means model with 4 clusters.
kmeans = KMeans(featuresCol='features', predictionCol='prediction',k=4)

#transform data using pipeline
pipeline = Pipeline(stages=[labelEncoder, assembler, kmeans])

#fir pipeline
PipelineModel = pipeline.fit(data)

# transform using the pipeline
predictions = PipelineModel.transform(data)

In [None]:
#view result
predictions.show()

### END

In [None]:
spark.stop()