<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


### Hands On Lab - Saving and loading a SparkML model


#### Objectives:
In this lab you will
 - Create a simple Linear Regression Model
 - Save the SparkML model
 - Load the SparkML model
 - Make predictions using the loaded SparkML model


#### Install pyspark


In [1]:
!pip install pyspark
!pip install findspark

Collecting pyspark
  Downloading pyspark-3.4.4.tar.gz (311.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.4/311.4 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting py4j==0.10.9.7 (from pyspark)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.5/200.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.4.4-py2.py3-none-any.whl size=311905466 sha256=12485098e3b73323df3add8de2ad2f85f1576ad4c51bcb70532482a65684b049
  Stored in directory: /home/jupyterlab/.cache/pip/wheels/4e/66/db/939eb1c49afb8a7fd2c4e393ad34e12b77db67bb4cc974c00e
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.4.4
C

#### Import libraries


In [2]:
import findspark
findspark.init()

In [3]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

#### Creating the spark session and context


In [4]:
# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Saving and Loading a SparkML Model").getOrCreate()

25/04/30 07:55:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


#### Importing Spark ML libraries


In [5]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

#### Create a DataFrame with sample data


In [6]:
# Create a simple data set of infant height(cms) weight(kgs) chart.

mydata = [[46,2.5],[51,3.4],[54,4.4],[57,5.1],[60,5.6],[61,6.1],[63,6.4]]
  
# Mention column names of dataframe
columns = ["height", "weight"]
  
# creating a dataframe
mydf = spark.createDataFrame(mydata, columns)
  
# show data frame
mydf.show()

                                                                                

+------+------+
|height|weight|
+------+------+
|    46|   2.5|
|    51|   3.4|
|    54|   4.4|
|    57|   5.1|
|    60|   5.6|
|    61|   6.1|
|    63|   6.4|
+------+------+



#### Converting data frame columns into feature vectors

In this task we use the `VectorAssembler()` function to convert the dataframe columns into feature vectors. 
For our example, we use the horsepower ("hp) and weight of the car as input features and the miles-per-gallon ("mpg") as target labels.


In [7]:
assembler = VectorAssembler(
    inputCols=["height"],
    outputCol="features")

data = assembler.transform(mydf).select('features','weight')

In [8]:
data.show()

+--------+------+
|features|weight|
+--------+------+
|  [46.0]|   2.5|
|  [51.0]|   3.4|
|  [54.0]|   4.4|
|  [57.0]|   5.1|
|  [60.0]|   5.6|
|  [61.0]|   6.1|
|  [63.0]|   6.4|
+--------+------+



#### Create and Train model

We can create the model using the `LinearRegression()` class and train using the `fit()` function. 


In [9]:
# Create a LR model
lr = LinearRegression(featuresCol='features', labelCol='weight', maxIter=100)
lr.setRegParam(0.1)
# Fit the model
lrModel = lr.fit(data)

25/04/30 07:56:06 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
[Stage 8:>                                                          (0 + 8) / 8]25/04/30 07:56:06 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
25/04/30 07:56:07 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
25/04/30 07:56:07 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
                                                                                

#### Save the model


In [10]:
lrModel.save('infantheight2.model')

                                                                                

#### Load the model


In [11]:
# You need LinearRegressionModel to load the model
from pyspark.ml.regression import LinearRegressionModel

In [12]:
model = LinearRegressionModel.load('infantheight2.model')

#### Make Prediction


#### Predict the weight of an infant whose height is 70 CMs.


In [13]:
# This function converts a scalar number into a dataframe that can be used by the model to predict.
def predict(height):
    assembler = VectorAssembler(inputCols=["height"], outputCol="features")  # Adjusted input column name
    data = [[height, 0]]  # Changed input to reflect height
    columns = ["height", "weight"]  # Updated column names for clarity
    df = spark.createDataFrame(data, columns)
    transformed_df = assembler.transform(df).select('features', 'weight')  # Updated column selection
    predictions = model.transform(transformed_df)
    predictions.select('prediction').show()


In [14]:
predict(70)

+-----------------+
|       prediction|
+-----------------+
|7.863454719775907|
+-----------------+



### Practice exercises


#### Save the model as `babyweightprediction.model`


In [16]:
!mkdir babyweightmodel

Double-click __here__ for the solution.

<!-- Hint:

lrModel.save('babyweightprediction.model')
-->


#### Load the model `babyweightprediction.model`


In [30]:
from pyspark.ml.pipeline import PipelineModel

model.write().overwrite().save('./babyweightmodel/')

                                                                                

In [32]:
model = LinearRegressionModel.load('babyweightprediction.model')

Py4JJavaError: An error occurred while calling o820.load.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/resources/labs/DB0321EN/babyweightprediction.model/metadata
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1343)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
	at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1378)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
	at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:615)
	at org.apache.spark.ml.regression.LinearRegressionModel$LinearRegressionModelReader.load(LinearRegression.scala:786)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)


Double-click __here__ for the solution.

<!-- Hint:

model = LinearRegressionModel.load('babyweightprediction.model')
-->


#### Predict the weight of an infant whose height is 50 CMs.


Double-click __here__ for the solution.

<!-- Hint:

predict(50)
-->


In [27]:
predict(50)


+------------------+
|        prediction|
+------------------+
|3.4666826711164465|
+------------------+

