[back](./02-spark-key-functions.ipynb)

---
## `Spark ML Intro`


https://spark.apache.org/docs/latest/ml-features.html

### `Spark-ML Objective`

The objective is to cover

1.  Chain spark DataFrame methods together to do **data munging**.
1.  Be able to describe the **Spark-ML** API, and recognize differences to **sk-learn**.
1.  Chain **Spark-ML** Transformations and Estimators together to compose ML pipelines.

### `Chaining Transformations together!`

In [1]:
import pyspark as ps
import pyspark.sql.functions as F
from pyspark import SQLContext

spark = ps.sql.SparkSession.builder\
  .master('local[*]')\
  .appName('spark-ml')\
  .getOrCreate()

sc = spark.sparkContext

22/06/14 14:43:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
sqlContext = SQLContext(sc)

#### Find the data on which AAPL's closing stock price was the highest

- Input DataFrame

In [3]:
# read CSV
df_aapl = sqlContext.read.csv('../../assets/AAPL.csv',
header=True,        # use headers or not
quote='"',          # char used for quotes
sep=',',            # char used for separation
inferSchema=True)   # do we infer schema or not?

df_aapl.show(5)

+----------+----------+----------+----------+----------+----------+--------+
|      Date|      Open|      High|       Low|     Close| Adj Close|  Volume|
+----------+----------+----------+----------+----------+----------+--------+
|2018-05-09|186.550003|187.399994|185.220001|187.360001|186.640305|23211200|
|2018-05-10|187.740005|190.369995|187.649994|190.039993|189.309998|27989300|
|2018-05-11|189.490005|190.059998|187.449997|188.589996|188.589996|26212200|
|2018-05-14|189.009995|189.529999|187.860001|188.149994|188.149994|20778800|
|2018-05-15|186.779999|187.070007|185.100006|186.440002|186.440002|23695200|
+----------+----------+----------+----------+----------+----------+--------+
only showing top 5 rows



In [4]:
df_aapl.schema

StructType(List(StructField(Date,StringType,true),StructField(Open,DoubleType,true),StructField(High,DoubleType,true),StructField(Low,DoubleType,true),StructField(Close,DoubleType,true),StructField(Adj Close,DoubleType,true),StructField(Volume,IntegerType,true)))

Now, we'll design a pipeline that will

1.  Keep only fields for Data and Close
1.  Order by Close in descending order

In [5]:
df_out = df_aapl.select('Date', 'Close').orderBy('Close', ascending=False)
df_out.show(5)

+----------+----------+
|      Date|     Close|
+----------+----------+
|2018-06-06|193.979996|
|2018-06-07|193.460007|
|2018-06-05|193.309998|
|2018-06-04|191.830002|
|2018-06-08|191.699997|
+----------+----------+
only showing top 5 rows



### `Supervised Machine Learning on DataFrames`

####  What is the difference between df_aapl and df_vector after running the code below?

In [6]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# Assemble values in a vector
vectorAssembler = VectorAssembler(inputCols=['Close'], outputCol='Features')

df_vector = vectorAssembler.transform(df_aapl)
df_aapl.show(5)
df_vector.show(5)

+----------+----------+----------+----------+----------+----------+--------+
|      Date|      Open|      High|       Low|     Close| Adj Close|  Volume|
+----------+----------+----------+----------+----------+----------+--------+
|2018-05-09|186.550003|187.399994|185.220001|187.360001|186.640305|23211200|
|2018-05-10|187.740005|190.369995|187.649994|190.039993|189.309998|27989300|
|2018-05-11|189.490005|190.059998|187.449997|188.589996|188.589996|26212200|
|2018-05-14|189.009995|189.529999|187.860001|188.149994|188.149994|20778800|
|2018-05-15|186.779999|187.070007|185.100006|186.440002|186.440002|23695200|
+----------+----------+----------+----------+----------+----------+--------+
only showing top 5 rows

+----------+----------+----------+----------+----------+----------+--------+------------+
|      Date|      Open|      High|       Low|     Close| Adj Close|  Volume|    Features|
+----------+----------+----------+----------+----------+----------+--------+------------+
|2018-05-09|

- Making the column a vector

In [7]:
scaler = MinMaxScaler(inputCol='Features', outputCol='Scaled Features')

# Compute summary statistics and generate MinMaxScalerModel
scaler_model = scaler.fit(df_vector)

# Rescale each feature tp range [min, max]
scaled_data = scaler_model.transform(df_vector)
scaled_data.select('Features', 'Scaled Features').show(5)

+------------+--------------------+
|    Features|     Scaled Features|
+------------+--------------------+
|[187.360001]|[0.13689742813492...|
|[190.039993]|[0.48630977478742...|
|[188.589996]|[0.2972618767306078]|
|[188.149994]|[0.23989523856459...|
|[186.440002]|[0.01694967847449...|
+------------+--------------------+
only showing top 5 rows



In [8]:
scaled_data.select('Features', 'Scaled Features').show(10)

+------------+--------------------+
|    Features|     Scaled Features|
+------------+--------------------+
|[187.360001]|[0.13689742813492...|
|[190.039993]|[0.48630977478742...|
|[188.589996]|[0.2972618767306078]|
|[188.149994]|[0.23989523856459...|
|[186.440002]|[0.01694967847449...|
|[188.179993]|[0.24380645210076...|
|[186.990005]|[0.08865804137106...|
|[186.309998]|               [0.0]|
|[187.630005]|[0.17210004487615...|
|[187.160004]|[0.11082219317397...|
+------------+--------------------+
only showing top 10 rows



### `Transformers`

The `VectorAssembler` class above is an example of a generic type is `Spark`, called as a `Transformer`. Important things to know about the type:

- They implement a `transform` method.
- They convert one **DataFrame** into another, usually by adding columns.

Examples of `Transformers:` VectorAssembler, Tokenizer, StopWordsRemover, and many more.

### `Estimators`

According to the docs: "An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data". Important things to know about this type are:

- They implement a `fit` method whose argument is a `DataFrame`.
- The output of `fit` is another type called `Model`, which is a `Transformer`.

Example of `Estimators:` LogisticRegression, DecisionTreeRegressor, and many more.

### `Conclusion`

In [9]:
sc.stop()


---
[next]()