# Spark Lab 7: Using Spark MLlib for Feature Engineering and Prediction


This lab focuses on building a ML pipeline with focus on feature data exploration and feature engineering. It has two parts:

- Part 1 `Concrete Quality`: we focus on doing column statistics and engineering numerical features 
- Part 2 `Car Value`: we focus on engineering string columns.


**Topics**: `describe, VectorAssembler, StandardScaler, Pipeline, StringIndexer, DecisionTreeClassifier, MulticlassClassificationEvaluator`




**Tip**:If at any point you see this error, `AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;`. 

Please
- removing the *.lck file from hive `metastore_db`
```
# assuming you're running this from your home directory from cloudera vm
rm  metastore_db/*.lck
```
- Terminate all other running jupyter notebooks (from your jupyter home, go to Running tab, then terminate). 

If the above does not work still, try to restart the kernel (from your current jupyter notebook's menu, kernel > restart).  

### Preparation

1\. Download and unzip the data files needed for this lab from 

[http://idsdl.csom.umn.edu/c/share/sparklab07data.zip](http://idsdl.csom.umn.edu/c/share/sparklab07data.zip)



In [4]:
!wget http://idsdl.csom.umn.edu/c/share/sparklab07data.zip


--2019-10-30 18:59:21--  http://idsdl.csom.umn.edu/c/share/sparklab07data.zip
Resolving idsdl.csom.umn.edu (idsdl.csom.umn.edu)... 134.84.138.46, 2607:ea00:101:480a:250:56ff:febb:e76b
Connecting to idsdl.csom.umn.edu (idsdl.csom.umn.edu)|134.84.138.46|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26506 (26K) [application/zip]
Saving to: ‘sparklab07data.zip.2’


2019-10-30 18:59:21 (22.9 MB/s) - ‘sparklab07data.zip.2’ saved [26506/26506]

Archive:  spark_sample_data.zip
  inflating: streaming/AFINN-111.txt  
  inflating: mllib/als/sample_movielens_ratings.txt  
  inflating: mllib/als/test.data     
  inflating: mllib/gmm_data.txt      
  inflating: mllib/kmeans_data.txt   
  inflating: mllib/pagerank_data.txt  
  inflating: mllib/pic_data.txt      
  inflating: mllib/ridge-data/lpsa.data  
  inflating: mllib/sample_binary_classification_data.txt  
  inflating: mllib/sample_fpgrowth.txt  
  inflating: mllib/sample_isotonic_regression_libsvm_data.txt  
  inflati

In [6]:
!unzip -o sparklab07data.zip

Archive:  sparklab07data.zip
  inflating: cars_test.csv           
  inflating: cars_train.csv          
  inflating: concrete.csv            
  inflating: concrete_test.csv       
  inflating: concrete_train.csv      


## Part I. Prepare Concrete Quality Dataset for Spark MLlib

This part of the lab uses adataset regarding the various properties and strength of concrete. Please complete the lab using `spark.ml` API functions. 

#|Field Name|Type| Description
--|--|--|--
0|cement|Double|Mass, in kg per cubic meter of mixture
1|blast_furnace_slag|Double|Mass, in kg per cubic meter of mixture
2|fly_ash|Double|Mass, in kg per cubic meter of mixture
3|water|Double|Mass, in kg per cubic meter of mixture
4|superplasticizer|Double|Mass, in kg per cubic meter of mixture
5|course_aggregate|Double|Mass, in kg per cubic meter of mixture
6|fine_aggregate|Double|Mass, in kg per cubic meter of mixture
7|age|Double|Age, in days
8|compressive_strength|Double|Strength, in megapascals (MPa)

1\. Sample the first few lines of `concrete_train.csv` using linux command(s), which helps you decide how to handle this file.

In [7]:
!head concrete_train.csv

540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.30
266.0,114.0,0.0,228.0,0.0,932.0,670.0,90,47.03
380.0,95.0,0.0,228.0,0.0,932.0,594.0,365,43.70
380.0,95.0,0.0,228.0,0.0,932.0,594.0,28,36.45
266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29


2\. Load the `concrete_train.csv` file into a dataframe. Verify its content

In [8]:
train = spark.read.option("inferSchema","true").csv("concrete_train.csv")

In [9]:
train.limit(2).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89


In [10]:
fields = ["cement","blast_furname_slag","fly_ash","water",
         "superplasticizer","coarse_aggregate","fine_aggregate",
         "age","compressive_strength"]
train2 = train.toDF(*fields)

# the above is same as
# train2 = train.toDF("cement","blast_furname_slag","fly_ash","water",
#         "superplasticizer","coarse_aggregate","fine_aggregate",
#         "age","compressive_strength")
train2.limit(2).toPandas()


Unnamed: 0,cement,blast_furname_slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,compressive_strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89


3\. Because this dataset has numerical columns, it is useful to conduct summary statistics on it. Report the descriptive statistics

**You may convert a DataFrame to pandas dataframe using `.toPandas()` for better readability.**

[`SQLTransformer(statement=...)`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.SQLTransformer) implements the transforms which are defined by SQL statement. 

Currently it only supports SQL syntax like `"SELECT … FROM __THIS__"` where `__THIS__` represents the underlying table of the input dataset.

4\. Define a SQLTransformer `st` that creates a new field `age_enc` that takes the following values:

- 1, if age is between 0 and 30
- 2, if age is > 30 and <= 90
- 3, if age is > 90 and <= 180
- 4, if age is 180 and above

Click the above link to see the documentation/example of this API.

5\. Use [`VectorAssembler`](https://spark.apache.org/docs/latest/ml-features.html#vectorassembler) to create a new `features` column with all fields except `compressive_strength` and `age`. 

If needed, click the link above to see an example from the official pyspark documentation.

Verify your assembler does what it needs to do:

`StandardScaler` transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:

- `withStd`: True by default. Scales the data to unit standard deviation.
- `withMean`: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.

6\. Create an instance of [`StandardScaler`](https://spark.apache.org/docs/latest/ml-features.html#standardscaler) called `ss`

- it should apply to `features` and create a new column `scaledfeatures`


Verify your standard scaler

7\. Create an instance of [`Pipeline`](https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline) called `pl`

-  Its `stages` should include the SQLTransformer, VectorAssembler and StandardScaler.

8\. Use the `Pipeline` to transform the data and obtain a new dataframe, `transformed`

Inspect `scaledFeatures` and `features` column in the transformed dataset

## Part 2. Using Decision Tree Classifiers to Predict Car Value

During this exercise, you will build a Spark ML Pipeline to encode categorical string variables as integers before using them to build a model. 

The data used for this exercise concerns various properties of cars, and whether or not these cars were
classified as a good value. The target value to be predictive is `acceptability`, which is a categorical variable representing whether or not a car is considered acceptable for purchase. All other feature variables are also **categorical**.

#|Field|Data Type |Description
--|--|--|--
0|buying|String|Based on selling price
1|maint|String|Based on cost to maintain the vehicle
2|doors|String|Number of doors
3|persons|String|Passenger capacity
4|lug_boot|String|Based on luggage boot size
5|safety|String|Based on estimated safety of the vehicle
6|acceptability|String|Based on overall acceptability of the vehicle

10\. Begin by importing the necessary modules for this exercise.

If you’re using the Scala, you’ll use this code:
```
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer,VectorAssembler}
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
```
If you’re using the the PySpark, you’ll use this code:
```
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
```


1\. Sample the first few lines of `cars_train.csv` using linux command(s), which helps you decide how to handle this file.

2\. Load the `cars_train.csv` file into a DataFrame named `train_df`. Verify its schema & content

4\. Explore the `buying`, `doors`,`persons`,`acceptability` columns by showing their distinct values. What kinds of columns are these?

5\. Create a new [`StringIndexer`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer) for each of the columns, with the output column name in the form of `[colname]_ix` (for example, `buying` becomes `buying_ix`). Save these seven StringIndexers as `si1`, `si2`, `si3`, and so on.

**Note**: the default sort order of `StringIndexer` is `frequencyDesc`, others include `frequencyAsc, alphabetDesc, alphabetAsc`. See the above link and click source to see more details.

6\. Next, create a `VectorAssembler` called `va` to assemble each of the indexed columns **except `accetability_ix`** into a new column called `features`. 



7\. Create a `DecisionTreeClassifier` 

- the label column should be `acceacceptability_ix` 
- the features column should be `features`. 

8\. Create a new Spark ML `Pipeline` called `pl` 

- the `steps` should include all of the `StringIndexer`s,  the `VectorAssembler`, and the `DecisionTreeClassifier`.

9\. Create a PipelineModel named `plmodel` by fitting the pipeline on `train_df`.

10\. Create a new DataFrame called `test_df` from the `cars_test.csv` dataset. 

12\. Applied the learned model on `test_df` and save the resultant DataFrame as `predictions`. 

- How many of the first 15 values in the `prediction` column match the values in the `acceptability_ix` column?  


13\. Using [`MulticlassClassificationEvaluator`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator) to evaluate the predictions on the `accuracy` metric. 