# Spark Lab 7: Using Spark MLlib for Feature Engineering and Prediction


This lab focuses on building a ML pipeline with focus on feature data exploration and feature engineering. It has two parts:

- Part 1 `Concrete Quality`: we focus on doing column statistics and engineering numerical features 
- Part 2 `Car Value`: we focus on engineering string columns.


**Topics**: `describe, VectorAssembler, StandardScaler, Pipeline, StringIndexer, DecisionTreeClassifier, MulticlassClassificationEvaluator`




**Tip**:If at any point you see this error, `AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;`. 

Please
- removing the *.lck file from hive `metastore_db`
```
# assuming you're running this from your home directory from cloudera vm
rm  metastore_db/*.lck
```
- Terminate all other running jupyter notebooks (from your jupyter home, go to Running tab, then terminate). 

If the above does not work still, try to restart the kernel (from your current jupyter notebook's menu, kernel > restart).  

### Preparation

1\. Download and unzip the data files needed for this lab from 

[http://idsdl.csom.umn.edu/c/share/sparklab07data.zip](http://idsdl.csom.umn.edu/c/share/sparklab07data.zip)

## Part I. Prepare Concrete Quality Dataset for Spark MLlib

This part of the lab uses adataset regarding the various properties and strength of concrete. Please complete the lab using `spark.ml` API functions. 

#|Field Name|Type| Description
--|--|--|--
0|cement|Double|Mass, in kg per cubic meter of mixture
1|blast_furnace_slag|Double|Mass, in kg per cubic meter of mixture
2|fly_ash|Double|Mass, in kg per cubic meter of mixture
3|water|Double|Mass, in kg per cubic meter of mixture
4|superplasticizer|Double|Mass, in kg per cubic meter of mixture
5|course_aggregate|Double|Mass, in kg per cubic meter of mixture
6|fine_aggregate|Double|Mass, in kg per cubic meter of mixture
7|age|Double|Age, in days
8|compressive_strength|Double|Strength, in megapascals (MPa)

1\. Sample the first few lines of `concrete_train.csv` using linux command(s), which helps you decide how to handle this file.

In [1]:
# You path may be different from the one shown below.
! head concrete_train.csv

540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.30
266.0,114.0,0.0,228.0,0.0,932.0,670.0,90,47.03
380.0,95.0,0.0,228.0,0.0,932.0,594.0,365,43.70
380.0,95.0,0.0,228.0,0.0,932.0,594.0,28,36.45
266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29


2\. Load the `concrete_train.csv` file into a dataframe. Verify its content

In [2]:
fields = ["cement","blast_furnace_slag","fly_ash","water", \
          "superplasticizer","course_aggregate","fine_aggregate", \
         "age","compressive_strength"]

In [4]:
# You path may be different. 
train = spark.read.option("inferSchema",True).csv("concrete_train.csv").toDF(*fields)

In [5]:
train.printSchema()

root
 |-- cement: double (nullable = true)
 |-- blast_furnace_slag: double (nullable = true)
 |-- fly_ash: double (nullable = true)
 |-- water: double (nullable = true)
 |-- superplasticizer: double (nullable = true)
 |-- course_aggregate: double (nullable = true)
 |-- fine_aggregate: double (nullable = true)
 |-- age: integer (nullable = true)
 |-- compressive_strength: double (nullable = true)



In [6]:
train.show()

+------+------------------+-------+-----+----------------+----------------+--------------+---+--------------------+
|cement|blast_furnace_slag|fly_ash|water|superplasticizer|course_aggregate|fine_aggregate|age|compressive_strength|
+------+------------------+-------+-----+----------------+----------------+--------------+---+--------------------+
| 540.0|               0.0|    0.0|162.0|             2.5|          1040.0|         676.0| 28|               79.99|
| 540.0|               0.0|    0.0|162.0|             2.5|          1055.0|         676.0| 28|               61.89|
| 332.5|             142.5|    0.0|228.0|             0.0|           932.0|         594.0|270|               40.27|
| 332.5|             142.5|    0.0|228.0|             0.0|           932.0|         594.0|365|               41.05|
| 198.6|             132.4|    0.0|192.0|             0.0|           978.4|         825.5|360|                44.3|
| 266.0|             114.0|    0.0|228.0|             0.0|           932

3\. Because this dataset has numerical columns, it is useful to conduct summary statistics on it. Report the descriptive statistics

**You may convert a DataFrame to pandas dataframe using `.toPandas()` for better readability.**

In [7]:
train.describe().toPandas()

Unnamed: 0,summary,cement,blast_furnace_slag,fly_ash,water,superplasticizer,course_aggregate,fine_aggregate,age,compressive_strength
0,count,900.0,900.0,900.0,900.0,900.0,900.0,900.0,900.0,900.0
1,mean,290.563222222222,68.4981111111111,49.03222222222221,180.72266666666667,5.810888888888885,981.0683333333328,776.2333333333324,48.21333333333333,36.37484444444442
2,stddev,104.64460120499344,85.81753820138678,61.89869603838202,21.761715870809034,6.173544447428106,75.5029483066943,81.38528207195104,67.20007349431862,17.210366992350007
3,min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
4,max,540.0,359.4,200.0,247.0,32.2,1145.0,992.6,365.0,82.6


[`SQLTransformer(statement=...)`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.SQLTransformer) implements the transforms which are defined by SQL statement. 

Currently it only supports SQL syntax like `"SELECT … FROM __THIS__"` where `__THIS__` represents the underlying table of the input dataset.

4\. Define a SQLTransformer `st` that creates a new field `age_enc` that takes the following values:

- 1, if age is between 0 and 30
- 2, if age is > 30 and <= 90
- 3, if age is > 90 and <= 180
- 4, if age is 180 and above

Click the above link to see the documentation/example of this API.

In [9]:
from pyspark.ml.feature import SQLTransformer

st = SQLTransformer(statement="""
select *, case when age between 0 and 30 then 1 
when (age > 30 and age <=90) then 2 
when (age>90 and age<=180) then 3 
else 4 end as age_enc
from __THIS__
""")

In [10]:
train_st = st.transform(train)

In [12]:
train_st.limit(10).toPandas()

Unnamed: 0,cement,blast_furnace_slag,fly_ash,water,superplasticizer,course_aggregate,fine_aggregate,age,compressive_strength,age_enc
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99,1
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89,1
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27,4
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05,4
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3,4
5,266.0,114.0,0.0,228.0,0.0,932.0,670.0,90,47.03,2
6,380.0,95.0,0.0,228.0,0.0,932.0,594.0,365,43.7,4
7,380.0,95.0,0.0,228.0,0.0,932.0,594.0,28,36.45,1
8,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85,1
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29,1


5\. Use [`VectorAssembler`](https://spark.apache.org/docs/latest/ml-features.html#vectorassembler) to create a new `features` column with all fields except `compressive_strength` and `age`. 

If needed, click the link above to see an example from the official pyspark documentation.

In [13]:
from pyspark.ml.feature import VectorAssembler

In [14]:
featureCols = ['cement',
 'blast_furnace_slag',
 'fly_ash',
 'water',
 'superplasticizer',
 'course_aggregate',
 'fine_aggregate',
 'age_enc']

In [15]:
assembler = VectorAssembler(inputCols=featureCols,outputCol="features")

Verify your assembler does what it needs to do:

In [16]:
train_va = assembler.transform(train_st)

In [17]:
train_va.limit(3).toPandas()

Unnamed: 0,cement,blast_furnace_slag,fly_ash,water,superplasticizer,course_aggregate,fine_aggregate,age,compressive_strength,age_enc,features
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99,1,"[540.0, 0.0, 0.0, 162.0, 2.5, 1040.0, 676.0, 1.0]"
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89,1,"[540.0, 0.0, 0.0, 162.0, 2.5, 1055.0, 676.0, 1.0]"
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27,4,"[332.5, 142.5, 0.0, 228.0, 0.0, 932.0, 594.0, ..."


`StandardScaler` transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:

- `withStd`: True by default. Scales the data to unit standard deviation.
- `withMean`: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.

6\. Create an instance of [`StandardScaler`](https://spark.apache.org/docs/latest/ml-features.html#standardscaler) called `ss`

- it should apply to `features` and create a new column `scaledfeatures`


In [18]:
from pyspark.ml.feature import StandardScaler

In [19]:
ss = StandardScaler(inputCol="features", outputCol="scaledfeatures")

Verify your standard scaler

In [20]:
train_ss = ss.fit(train_va).transform(train_va)

In [21]:
train_ss.limit(3).toPandas()

Unnamed: 0,cement,blast_furnace_slag,fly_ash,water,superplasticizer,course_aggregate,fine_aggregate,age,compressive_strength,age_enc,features,scaledfeatures
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99,1,"[540.0, 0.0, 0.0, 162.0, 2.5, 1040.0, 676.0, 1.0]","[5.16032355021, 0.0, 0.0, 7.44426592837, 0.404..."
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89,1,"[540.0, 0.0, 0.0, 162.0, 2.5, 1055.0, 676.0, 1.0]","[5.16032355021, 0.0, 0.0, 7.44426592837, 0.404..."
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27,4,"[332.5, 142.5, 0.0, 228.0, 0.0, 932.0, 594.0, ...","[3.17742144527, 1.66049974151, 0.0, 10.4771150..."


7\. Create an instance of [`Pipeline`](https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline) called `pl`

-  Its `stages` should include the SQLTransformer, VectorAssembler and StandardScaler.

In [22]:
from pyspark.ml import Pipeline

In [23]:
pl = Pipeline(stages=[st, assembler,ss])

8\. Use the `Pipeline` to transform the data and obtain a new dataframe, `transformed`

In [24]:
transformed = pl.fit(train).transform(train)

Inspect `scaledFeatures` and `features` column in the transformed dataset

In [25]:
transformed.select("features","scaledfeatures").limit(5).collect()

[Row(features=DenseVector([540.0, 0.0, 0.0, 162.0, 2.5, 1040.0, 676.0, 1.0]), scaledfeatures=DenseVector([5.1603, 0.0, 0.0, 7.4443, 0.405, 13.7743, 8.3062, 1.1955])),
 Row(features=DenseVector([540.0, 0.0, 0.0, 162.0, 2.5, 1055.0, 676.0, 1.0]), scaledfeatures=DenseVector([5.1603, 0.0, 0.0, 7.4443, 0.405, 13.973, 8.3062, 1.1955])),
 Row(features=DenseVector([332.5, 142.5, 0.0, 228.0, 0.0, 932.0, 594.0, 4.0]), scaledfeatures=DenseVector([3.1774, 1.6605, 0.0, 10.4771, 0.0, 12.3439, 7.2986, 4.7821])),
 Row(features=DenseVector([332.5, 142.5, 0.0, 228.0, 0.0, 932.0, 594.0, 4.0]), scaledfeatures=DenseVector([3.1774, 1.6605, 0.0, 10.4771, 0.0, 12.3439, 7.2986, 4.7821])),
 Row(features=DenseVector([198.6, 132.4, 0.0, 192.0, 0.0, 978.4, 825.5, 4.0]), scaledfeatures=DenseVector([1.8979, 1.5428, 0.0, 8.8228, 0.0, 12.9584, 10.1431, 4.7821]))]

## Part 2. Using Decision Tree Classifiers to Predict Car Value

During this exercise, you will build a Spark ML Pipeline to encode categorical string variables as integers before using them to build a model. 

The data used for this exercise concerns various properties of cars, and whether or not these cars were
classified as a good value. The target value to be predictive is `acceptability`, which is a categorical variable representing whether or not a car is considered acceptable for purchase. All other feature variables are also **categorical**.

#|Field|Data Type |Description
--|--|--|--
0|buying|String|Based on selling price
1|maint|String|Based on cost to maintain the vehicle
2|doors|String|Number of doors
3|persons|String|Passenger capacity
4|lug_boot|String|Based on luggage boot size
5|safety|String|Based on estimated safety of the vehicle
6|acceptability|String|Based on overall acceptability of the vehicle

10\. Begin by importing the necessary modules for this exercise.

If you’re using the Scala, you’ll use this code:
```
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer,VectorAssembler}
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
```
If you’re using the the PySpark, you’ll use this code:
```
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
```

In [26]:
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

1\. Sample the first few lines of `cars_train.csv` using linux command(s), which helps you decide how to handle this file.

In [27]:
# your path may vary, default is cars_train.csv
!head cars_train.csv

vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
vhigh,vhigh,2,2,big,low,unacc
vhigh,vhigh,2,2,big,med,unacc
vhigh,vhigh,2,2,big,high,unacc
vhigh,vhigh,2,4,small,low,unacc


2\. Load the `cars_train.csv` file into a DataFrame named `train_df`. Verify its schema & content

In [28]:
schema_str="""
    buying string, maint string, doors string, persons string, 
    lug_boot string, safety string, acceptability string
"""
train_df = spark.read.csv("cars_train.csv",schema=schema_str)

In [29]:
train_df.printSchema()

root
 |-- buying: string (nullable = true)
 |-- maint: string (nullable = true)
 |-- doors: string (nullable = true)
 |-- persons: string (nullable = true)
 |-- lug_boot: string (nullable = true)
 |-- safety: string (nullable = true)
 |-- acceptability: string (nullable = true)



3\. Map `rawdata` to a new RDD of Scala arrays or Python lists called `lrdd` by splitting on commas.

In [30]:
train_df.show()

+------+-----+-----+-------+--------+------+-------------+
|buying|maint|doors|persons|lug_boot|safety|acceptability|
+------+-----+-----+-------+--------+------+-------------+
| vhigh|vhigh|    2|      2|   small|   low|        unacc|
| vhigh|vhigh|    2|      2|   small|   med|        unacc|
| vhigh|vhigh|    2|      2|   small|  high|        unacc|
| vhigh|vhigh|    2|      2|     med|   low|        unacc|
| vhigh|vhigh|    2|      2|     med|   med|        unacc|
| vhigh|vhigh|    2|      2|     med|  high|        unacc|
| vhigh|vhigh|    2|      2|     big|   low|        unacc|
| vhigh|vhigh|    2|      2|     big|   med|        unacc|
| vhigh|vhigh|    2|      2|     big|  high|        unacc|
| vhigh|vhigh|    2|      4|   small|   low|        unacc|
| vhigh|vhigh|    2|      4|   small|   med|        unacc|
| vhigh|vhigh|    2|      4|   small|  high|        unacc|
| vhigh|vhigh|    2|      4|     med|   low|        unacc|
| vhigh|vhigh|    2|      4|     med|   med|        unac

4\. Explore the `buying`, `doors`,`persons`,`acceptability` columns by showing their distinct values. What kinds of columns are these?

In [31]:
train_df.select("buying").distinct().show()

+------+
|buying|
+------+
|   low|
| vhigh|
|   med|
|  high|
+------+



In [32]:
train_df.select("doors").distinct().show()

+-----+
|doors|
+-----+
|    3|
|    4|
|5more|
|    2|
+-----+



In [33]:
train_df.select("persons").distinct().show()

+-------+
|persons|
+-------+
|   more|
|      4|
|      2|
+-------+



In [34]:
train_df.select("acceptability").distinct().show()

+-------------+
|acceptability|
+-------------+
|        unacc|
|          acc|
|        vgood|
|         good|
+-------------+



These are categorical values.

5\. Create a new [`StringIndexer`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer) for each of the columns, with the output column name in the form of `[colname]_ix` (for example, `buying` becomes `buying_ix`). Save these seven StringIndexers as `si1`, `si2`, `si3`, and so on.

**Note**: the default sort order of `StringIndexer` is `frequencyDesc`, others include `frequencyAsc, alphabetDesc, alphabetAsc`. See the above link and click source to see more details.

In [35]:
si1 = StringIndexer(inputCol='buying',outputCol='buying_ix')
si2 = StringIndexer(inputCol='maint',outputCol='maint_ix')
si3 = StringIndexer(inputCol='doors',outputCol='doors_ix')
si4 = StringIndexer(inputCol='persons',outputCol='persons_ix')
si5 = StringIndexer(inputCol='lug_boot',outputCol='lug_boot_ix')
si6 = StringIndexer(inputCol='safety',outputCol='safety_ix')
si7 = StringIndexer(inputCol='acceptability',outputCol='acceptability_ix')

6\. Next, create a `VectorAssembler` called `va` to assemble each of the indexed columns **except `accetability_ix`** into a new column called `features`. 



In [36]:
indexedcols = ["buying_ix","maint_ix","doors_ix","persons_ix","lug_boot_ix","safety_ix"]

In [37]:
va = VectorAssembler(inputCols=indexedcols,outputCol="features")

7\. Create a `DecisionTreeClassifier` 

- the label column should be `acceacceptability_ix` 
- the features column should be `features`. 

In [38]:
dt = DecisionTreeClassifier(featuresCol='features',labelCol='acceptability_ix')

8\. Create a new Spark ML `Pipeline` called `pl` 

- the `steps` should include all of the `StringIndexer`s,  the `VectorAssembler`, and the `DecisionTreeClassifier`.

In [39]:
pl = Pipeline(stages=[si1,si2, si3, si4, si5, si6, si7, va, dt])

9\. Create a PipelineModel named `plmodel` by fitting the pipeline on `train_df`.

In [40]:
plmodel = pl.fit(train_df)

10\. Create a new DataFrame called `test_df` from the `cars_test.csv` dataset. 

In [41]:
test_df = spark.read.csv("cars_test.csv",schema=schema_str)

12\. Applied the learned model on `test_df` and save the resultant DataFrame as `predictions`. 

- How many of the first 15 values in the `prediction` column match the values in the `acceptability_ix` column?  


In [42]:
predictions = plmodel.transform(test_df)

In [43]:
predictions.limit(15).toPandas()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability,buying_ix,maint_ix,doors_ix,persons_ix,lug_boot_ix,safety_ix,acceptability_ix,features,rawPrediction,probability,prediction
0,low,high,5more,4,big,low,unacc,3.0,1.0,3.0,1.0,2.0,2.0,0.0,"[3.0, 1.0, 3.0, 1.0, 2.0, 2.0]","[332.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0]",0.0
1,low,high,5more,4,big,med,acc,3.0,1.0,3.0,1.0,2.0,1.0,1.0,"[3.0, 1.0, 3.0, 1.0, 2.0, 1.0]","[9.0, 122.0, 23.0, 0.0]","[0.0584415584416, 0.792207792208, 0.1493506493...",1.0
2,low,high,5more,4,big,high,vgood,3.0,1.0,3.0,1.0,2.0,0.0,2.0,"[3.0, 1.0, 3.0, 1.0, 2.0, 0.0]","[9.0, 122.0, 23.0, 0.0]","[0.0584415584416, 0.792207792208, 0.1493506493...",1.0
3,low,high,5more,more,small,low,unacc,3.0,1.0,3.0,2.0,0.0,2.0,0.0,"[3.0, 1.0, 3.0, 2.0, 0.0, 2.0]","[332.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0]",0.0
4,low,high,5more,more,small,med,acc,3.0,1.0,3.0,2.0,0.0,1.0,1.0,"[3.0, 1.0, 3.0, 2.0, 0.0, 1.0]","[31.0, 47.0, 0.0, 0.0]","[0.397435897436, 0.602564102564, 0.0, 0.0]",1.0
5,low,high,5more,more,small,high,acc,3.0,1.0,3.0,2.0,0.0,0.0,1.0,"[3.0, 1.0, 3.0, 2.0, 0.0, 0.0]","[31.0, 47.0, 0.0, 0.0]","[0.397435897436, 0.602564102564, 0.0, 0.0]",1.0
6,low,high,5more,more,med,low,unacc,3.0,1.0,3.0,2.0,1.0,2.0,0.0,"[3.0, 1.0, 3.0, 2.0, 1.0, 2.0]","[332.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0]",0.0
7,low,high,5more,more,med,med,acc,3.0,1.0,3.0,2.0,1.0,1.0,1.0,"[3.0, 1.0, 3.0, 2.0, 1.0, 1.0]","[9.0, 122.0, 23.0, 0.0]","[0.0584415584416, 0.792207792208, 0.1493506493...",1.0
8,low,high,5more,more,med,high,vgood,3.0,1.0,3.0,2.0,1.0,0.0,2.0,"[3.0, 1.0, 3.0, 2.0, 1.0, 0.0]","[9.0, 122.0, 23.0, 0.0]","[0.0584415584416, 0.792207792208, 0.1493506493...",1.0
9,low,high,5more,more,big,low,unacc,3.0,1.0,3.0,2.0,2.0,2.0,0.0,"[3.0, 1.0, 3.0, 2.0, 2.0, 2.0]","[332.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0]",0.0


In [44]:
predictions.select("acceptability", "acceptability_ix", "prediction").limit(15).toPandas()

Unnamed: 0,acceptability,acceptability_ix,prediction
0,unacc,0.0,0.0
1,acc,1.0,1.0
2,vgood,2.0,1.0
3,unacc,0.0,0.0
4,acc,1.0,1.0
5,acc,1.0,1.0
6,unacc,0.0,0.0
7,acc,1.0,1.0
8,vgood,2.0,1.0
9,unacc,0.0,0.0


13\. Using `MulticlassClassificationEvaluator` to evaluate the predictions on the `accuracy` metric. 

In [45]:
e = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='acceptability_ix')

In [46]:
e.evaluate(predictions,{e.metricName: "accuracy"})

0.7236842105263158