# 07 - 05

## 05 - Scala Machine Learning

In this exercise, we will introduce SparkML 10 times faster, with Scala !

First, install the following Packages to have Scala available :

```bash
pip install spylon-kernel
python -m spylon_kernel install
```


Then, in this notebook, in "Kernel", click on Change Kernel and "spylon-kernel". You are now ready to program in Scala.

# Scala basics

## Basics

*Print values:*

In [2]:
println(1)
println(1 + 1)
println("Hello!")
println("Hello," + " world!")

1
2
Hello!
Hello, world!


*Declare a variable:*

In [3]:
val x = 1 + 1
println(x)

2


x: Int = 2


*Types can be also declared:*

In [4]:
val x2: Int = 1 + 1

x2: Int = 2


In [5]:
println(x2.getClass)

int


*And you can define a variable as a combination of other variables:*

In [6]:
var x3: Int = x * x2

x3: Int = 4


## Functions

*Declare a function:*

In [7]:
val addOne = { (x: Int) => x + 1 }

addOne: Int => Int = <function1>


In [8]:
println(addOne(1))

2


*Delcare a function with multiple parameters:*

In [9]:
val add = { (x: Int, y: Int) => x + y }

add: (Int, Int) => Int = <function2>


In [10]:
println(add(x, x2))

4


## Methods

*Methods are like functions, but can take multiple parameter sequences*:

In [11]:
def addThenMultiply(x: Int, y: Int)(multiplier: Int)(minus: Int = 5):Int = {
    (x + y) * multiplier - minus
}

println(addThenMultiply(1,2)(3)())

4


addThenMultiply: (x: Int, y: Int)(multiplier: Int)(minus: Int)Int


# Scala for Machine Learning

**Q1: Create your Spark Session**

In [33]:
val spark = 

2019-11-15 10:59:21 WARN  SparkSession$Builder:66 - Using an existing SparkSession; some configuration may not take effect.


spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@33ce3c3b


**Q2: Read the CSV file `house.csv`**

In [15]:
val df = 

+---+--------------------+--------------------+--------------------+---------+--------+------------+------------+-------------+---------+-----------+---------------+----------+-----------+---------+---------+-----------+-------+-----------+
|_c0|             address|                info|           z_address|bathrooms|bedrooms|finishedsqft|lastsolddate|lastsoldprice| latitude|  longitude|   neighborhood|totalrooms|    usecode|yearbuilt|zestimate|zindexvalue|zipcode|       zpid|
+---+--------------------+--------------------+--------------------+---------+--------+------------+------------+-------------+---------+-----------+---------------+----------+-----------+---------+---------+-----------+-------+-----------+
|  2|Address: 1160 Mis...| San FranciscoSal...|1160 Mission St U...|      2.0|     2.0|      1043.0|  02/17/2016|    1300000.0|37.778705|-122.412635|South of Market|       4.0|Condominium|   2007.0|1167508.0|    975,700|94103.0|8.3152781E7|
|  5|Address: 260 King...| San Franc

df: org.apache.spark.sql.DataFrame = [_c0: int, address: string ... 17 more fields]


In [26]:
df.columns

res16: Array[String] = Array(_c0, address, info, z_address, bathrooms, bedrooms, finishedsqft, lastsolddate, lastsoldprice, latitude, longitude, neighborhood, totalrooms, usecode, yearbuilt, zestimate, zindexvalue, zipcode, zpid)


**Q3: Run a basic SQL Data Exploration. What is the average price? What are the most frequent ZIP Codes?**

In [27]:
sqlDF.show()

+------------------+
|avg(lastsoldprice)|
+------------------+
|1263928.1871138571|
+------------------+



sqlDF: org.apache.spark.sql.DataFrame = [avg(lastsoldprice): double]


In [30]:
sqlDF2.show(5)

+-------+---+
|zipcode|num|
+-------+---+
|94110.0|935|
|94112.0|877|
|94107.0|857|
|94131.0|687|
|94116.0|655|
+-------+---+
only showing top 5 rows



sqlDF: org.apache.spark.sql.DataFrame = [zipcode: double, num: bigint]


**Q4: What is the schema of the data?**

root
 |-- _c0: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- info: string (nullable = true)
 |-- z_address: string (nullable = true)
 |-- bathrooms: double (nullable = true)
 |-- bedrooms: double (nullable = true)
 |-- finishedsqft: double (nullable = true)
 |-- lastsolddate: string (nullable = true)
 |-- lastsoldprice: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- neighborhood: string (nullable = true)
 |-- totalrooms: double (nullable = true)
 |-- usecode: string (nullable = true)
 |-- yearbuilt: double (nullable = true)
 |-- zestimate: double (nullable = true)
 |-- zindexvalue: string (nullable = true)
 |-- zipcode: double (nullable = true)
 |-- zpid: double (nullable = true)



**Q5: Which column could be converted using a string indexer? Apply it.**

In [47]:
val indexer = 

indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_a1d478fb5d54
df2: org.apache.spark.sql.DataFrame = [_c0: int, address: string ... 18 more fields]


**Q6: Buil your Vector Assembler.**

In [52]:
val assembler = 
val df3 = assembler.transform(df2)

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_11e7ea3ab1ba
df3: org.apache.spark.sql.DataFrame = [_c0: int, address: string ... 19 more fields]


**Q7: Split your data into a train and test.**

In [53]:
val 

train: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: int, address: string ... 19 more fields]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: int, address: string ... 19 more fields]


**Q8: Train the regression algorithm of your choice.**

import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
gbt: org.apache.spark.ml.regression.GBTRegressor = gbtr_f7b35fd180c7


In [56]:
val model = 

model: org.apache.spark.ml.regression.GBTRegressionModel = GBTRegressionModel (uid=gbtr_f7b35fd180c7) with 10 trees


**Q9: Make predictions an compute the metric of your choice.**

In [59]:
val predictions = 

2019-11-15 11:20:02 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-11-15 11:20:02 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
+------------------+-------------+--------------------+
|        prediction|lastsoldprice|            features|
+------------------+-------------+--------------------+
|1613879.2210632167|    1530000.0|[3.0,3.0,1300.0,3...|
|1389284.2393296428|    1440000.0|[3.0,3.0,2378.0,3...|
| 1369447.861598761|    1700000.0|[2.5,2.0,2063.0,3...|
| 770113.3958960483|     700000.0|[1.0,2.0,1250.0,3...|
|1062512.6163005617|    1525000.0|[3.75,4.0,1846.0,...|
+------------------+-------------+--------------------+
only showing top 5 rows



In [61]:
val evaluator = 

Root Mean Squared Error (RMSE) on test data = 695538.9945380276


import org.apache.spark.ml.evaluation.RegressionEvaluator
evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_273791166c7d
rmse: Double = 695538.9945380276
