# Predict Low Birth Weight Cases for Newborn Babies

Dataset Source: Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression, 3rd ed., New York: Wiley

This dataset is also part of the aplore3 R package.

## Table of Contents
- [Load Libraries](#load_libraries)
- [Access Data](#access_data)
- [Split Data into Training and Test Set](#training_test)
- [Build Logistic Regression Model](#build_model)
- [Logistic Regression Predictions for Test Data](#test_data)
- [Evaluate Logistic Regression Model](#evaluate_model)
- [Build Naive Bayes Model](#build_model_2)
- [Naive Bayes Predictions for Test Data](#test_data_2)
- [Evaluate Naive Bayes Model](#evaluate_model_2)
- [Build Decision Tree Model](#build_model_3)
- [Decision Tree Predictions for Test Data](#test_data_3)
- [Evaluate Decision Tree Model](#evaluate_model_3)

<a id="load_libraries"></a>
## Load Libraries

The Spark and Python libraries that you need are preinstalled in the notebook environment and only need to be loaded.

Run the following cell to load the libraries you will work with in this notebook:

In [256]:
# PySpark Machine Learning Library
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, NaiveBayes, MultilayerPerceptronClassifier, DecisionTreeClassifier
from pyspark.ml.feature import HashingTF, Tokenizer, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import Row, SQLContext

import os
import sys
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Library for confusion matrix, precision, test error
from pyspark.mllib.evaluation import MulticlassMetrics
# Library For Area under ROC curve and Area under precision-recall curve
from pyspark.mllib.evaluation import BinaryClassificationMetrics

# Assign resources to the application
sqlContext = SQLContext(sc)

In [257]:
# The data will be loaded into an array.
# This is the summary of the data structure, including the column position and name.
# The first filed starts from position 0. 

# 0 ID      -  Identification code
# 1 LOW     -  Low birth weight (0: >= 2500 g, 1: < 2500 g), target variable
# 2 AGE     -  Mother's age in years
# 3 RACE    -  Race (1: White, 2: Black, 3: Other)
# 4 SMOKE   -  Smoking status during pregnancy (1: No, 2: Yes)
# 5 PTL     -  History of premature labor (1: None, 2: One, 3: Two, etc)
# 6 HT      -  History of hypertension (1: No, 2: Yes)
# 7 UI      -  Presence of Uterine irritability (1: No, 2: Yes)
# 8 FTV     -  Number of physician visits during the first trimester (1: None, 2: One, 3: Two, etc)

# Label is a target variable. PersonInfo is a list of independent variables besides unique identifier

LabeledDocument = Row("ID", "PersonInfo", "label")

# Define a function that parses the raw CSV file and returns an object of type LabeledDocument

def parseDocument(line):
    values = [str(x) for x in line.split(',')] 
    if (values[1]>'0'):
      LOW = 1.0
    else:
      LOW = 0.0
        
    textValue = str(values[2]) + " " + str(values[3])+ " " + str(values[4]) + str(values[5])+ " " + str(values[6]) + str(values[7])+ " " + str(values[8])
    return LabeledDocument(values[0], textValue, LOW)

<a id="access_data"></a>
## Access Data
Before you can access data in the data file in the Object Storage, you must setup the Spark configuration with your Object Storage credentials. 

To do this, click on the cell below and select the **Insert to code > Insert Spark Session DataFrame** function from the Files tab below the data file you want to work with.

<div class="alert alert-block alert-info">The following code contains the credentials for a file in your IBM Cloud Object Storage. Delete the code starting from `from pyspark.sql import SparkSession` line before you run the cell.</div>

In [258]:
# Object Storage Credentials
import ibmos2spark

# @hidden_cell
credentials = {
    'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'api_key': 'lue6np0RwcvhARfJIQNtvVTUt3I45m5qk9UHM6LFTc3B',
    'service_id': 'iam-ServiceId-3ac1bdb0-c8a1-4ae8-abe0-2dd53ae4c6e9',
    'iam_service_endpoint': 'https://iam.ng.bluemix.net/oidc/token'}

configuration_name = 'os_9d550c7f6655453f915511631f5b077f_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

Now let's load the data into a `Spark RDD` and output the number of rows and first 5 rows.
Each project you create has a bucket in your object storage. You can get the bucket name from the project Settings page. Change the string `BUCKET` to the bucket name

In [259]:
data = sc.textFile(cos.url('lowbwt.csv', 'assignment331c5c83422a7434ab113135bf62f5fb0'))
print "Total records in the data set:", data.count()
print "The first 5 rows"
data.take(5)

Total records in the data set: 190
The first 5 rows


[u'ID,LOW,AGE,RACE,SMOKE,PTL,HT,UI,FTV',
 u'85,0,19,2,0,0,0,1,0',
 u'86,0,33,3,0,0,0,0,3',
 u'87,0,20,1,1,0,0,0,1',
 u'88,0,21,1,1,0,0,1,2']

Crate DataFrame from RDD

In [260]:
#Load the data into a dataframe, parse it using the function above
documents = data.filter(lambda s: "Name" not in s).map(parseDocument)
lowbwtData = documents.toDF() # ToDataFrame
print "Number of records: " + str(lowbwtData.count())
print "First 5 records: "
lowbwtData.take(5)

Number of records: 190
First 5 records: 


[Row(ID=u'ID', PersonInfo=u'AGE RACE SMOKEPTL HTUI FTV', label=1.0),
 Row(ID=u'85', PersonInfo=u'19 2 00 01 0', label=0.0),
 Row(ID=u'86', PersonInfo=u'33 3 00 00 3', label=0.0),
 Row(ID=u'87', PersonInfo=u'20 1 10 00 1', label=0.0),
 Row(ID=u'88', PersonInfo=u'21 1 10 01 2', label=0.0)]

<a id="training_test"></a>
## Split Data into Training and Test Set

We divide the data into training and test set.  The training set is used to build the model to be used on future data, and the test set is used to evaluate the model.

In [261]:
# Divide the data into training and test set, with random seed to reproduce results
(train, test) = lowbwtData.randomSplit([0.6, 0.4], seed = 123)
print "Number of records in the training set: " + str(train.count())
print "Number of records in the test set: " + str(test.count())
# Output first 20 records in the training set
print "First 20 records in the training set: "
train.show()

Number of records in the training set: 114
Number of records in the test set: 76
First 20 records in the training set: 
+---+------------+-----+
| ID|  PersonInfo|label|
+---+------------+-----+
|100|18 1 10 00 0|  0.0|
|104|20 3 00 01 0|  0.0|
|105|28 1 10 00 1|  0.0|
|106|32 3 00 00 2|  0.0|
|107|31 1 00 01 3|  0.0|
|109|28 3 00 00 0|  0.0|
|112|28 1 00 00 0|  0.0|
|113|17 1 10 00 0|  0.0|
|114|29 1 00 00 2|  0.0|
|115|26 2 10 00 0|  0.0|
|119|35 2 11 00 1|  0.0|
|120|25 1 00 00 1|  0.0|
|123|29 1 10 00 2|  0.0|
|124|19 1 10 00 2|  0.0|
|127|33 1 10 00 1|  0.0|
|128|21 2 10 00 2|  0.0|
|130|23 2 00 00 1|  0.0|
|132|18 1 10 01 0|  0.0|
|133|18 1 10 01 0|  0.0|
|135|19 3 00 00 0|  0.0|
+---+------------+-----+
only showing top 20 rows



<a id="build_model"></a>
## Build Logistic Regression Model

We use the Pipeline of SparkML to build the Logistic Regression Model

In [262]:
# set up Logistic Regression using Pipeline of SparkML
tokenizer = Tokenizer(inputCol="PersonInfo", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=50, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

In [263]:
# set up Logistic Regression Model
# the stages are executed in order
model = pipeline.fit(train)
#[stage.coefficients for stage in model.stages if hasattr(stage, "coefficients")]
# model.stages[2].intercept
model.stages[2].coefficients

SparseVector(262144, {8227: -0.8026, 18127: -1.8382, 18659: -2.7236, 24537: -0.3984, 31351: 3.2539, 37812: 4.4797, 64358: 2.2765, 69821: -3.0029, 83214: 0.0322, 89074: 1.4235, 98627: -3.8517, 109681: 0.3086, 110466: 8.547, 112272: 0.8523, 128319: 0.8382, 139093: 0.1271, 146429: -5.4366, 147946: -2.9156, 175329: -2.6536, 177493: 2.6024, 187043: 1.6197, 207020: -1.8509, 212053: 0.5704, 213217: -10.6018, 219381: -0.0939, 233878: -5.2583, 236232: -0.0424, 242525: -1.9885, 250051: -3.8739, 250733: -3.5146, 250802: 1.7828, 252551: -7.0663, 257339: 8.6457, 259362: 0.5064, 259523: 0.8185})

<a id="test_data"></a>
## Logistic Regression Predictions for Test Data

In [264]:
# Make predictions on test documents and print columns of interest
prediction = model.transform(test)
selected = prediction.select("PersonInfo", "prediction", "probability")
for row in selected.collect():
    print row
#for row in prediction.collect():
#    print row

Row(PersonInfo=u'18 1 10 00 0', prediction=0.0, probability=DenseVector([0.9974, 0.0026]))
Row(PersonInfo=u'15 2 00 00 0', prediction=0.0, probability=DenseVector([0.9904, 0.0096]))
Row(PersonInfo=u'25 1 10 00 3', prediction=1.0, probability=DenseVector([0.084, 0.916]))
Row(PersonInfo=u'36 1 00 00 1', prediction=0.0, probability=DenseVector([0.9889, 0.0111]))
Row(PersonInfo=u'25 3 00 01 2', prediction=1.0, probability=DenseVector([0.0458, 0.9542]))
Row(PersonInfo=u'17 2 00 00 1', prediction=0.0, probability=DenseVector([0.8315, 0.1685]))
Row(PersonInfo=u'17 2 00 00 1', prediction=0.0, probability=DenseVector([0.8315, 0.1685]))
Row(PersonInfo=u'24 1 11 00 1', prediction=1.0, probability=DenseVector([0.0004, 0.9996]))
Row(PersonInfo=u'25 2 00 00 0', prediction=0.0, probability=DenseVector([0.884, 0.116]))
Row(PersonInfo=u'27 1 10 00 0', prediction=1.0, probability=DenseVector([0.0022, 0.9978]))
Row(PersonInfo=u'31 1 10 00 2', prediction=0.0, probability=DenseVector([0.9817, 0.0183]))
Row

In [265]:
#Tabulate the predicted outcome
prediction.select("prediction").groupBy("prediction").count().show(truncate=False)

+----------+-----+
|prediction|count|
+----------+-----+
|0.0       |56   |
|1.0       |20   |
+----------+-----+



In [266]:
#Tabulate the actual outcome
prediction.select("label").groupBy("label").count().show(truncate=False)

+-----+-----+
|label|count|
+-----+-----+
|0.0  |53   |
|1.0  |23   |
+-----+-----+



In [267]:
# This table shows:
# 1. The number of low birth weight infants predicted as having low birth weight
# 2. The number of low birth weight infants predicted as not having low birth weight
# 3. The number of regular birth weight infants predicted as having low birth weight
# 4. The number of regular birth weight infants predicted as not having low birth weight

prediction.crosstab('label', 'prediction').show()

+----------------+---+---+
|label_prediction|0.0|1.0|
+----------------+---+---+
|             1.0| 15|  8|
|             0.0| 41| 12|
+----------------+---+---+



<a id="evaluate_model"></a>
## Evaluate Logistic Regression Model

We evaluate the model on a training set and on a test set.  The purpose is to measure the model's predictive accuracy, including the accuracy for new data.

In [268]:
# Evaluate the Logistic Regression model on a training set
# Select (prediction, true label) and compute training error
pred_lr=model.transform(train).select("prediction", "label")
eval_lr=MulticlassClassificationEvaluator (
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_lr=eval_lr.evaluate(pred_lr)
# create RDD
predictionAndLabels_lr=pred_lr.rdd
metrics_lr=MulticlassMetrics(predictionAndLabels_lr)
precision_lr=metrics_lr.precision(1.0)
recall_lr=metrics_lr.recall(1.0)
f1Measure_lr = metrics_lr.fMeasure(1.0, 1.0)
print("F1 Measure = %s" % f1Measure_lr)
print ("Training Accuracy = %s" %accuracy_lr)
print ("Training Error = %s" % (1-accuracy_lr))
print ("Precision = %s" %precision_lr)
print ("Recall = %s" %recall_lr)

F1 Measure = 0.782608695652
Training Accuracy = 0.868421052632
Training Error = 0.131578947368
Precision = 0.84375
Recall = 0.72972972973


In [269]:
# Evaluate the Logistic Regression model on a test set
# Select (prediction, true label) and compute test error
pred_lr=model.transform(test).select("prediction", "label")
eval_lr=MulticlassClassificationEvaluator (
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_lr=eval_lr.evaluate(pred_lr)
# create RDD
predictionAndLabels_lr=pred_lr.rdd
metrics_lr=MulticlassMetrics(predictionAndLabels_lr)
precision_lr=metrics_lr.precision(1.0)
recall_lr=metrics_lr.recall(1.0)
f1Measure_lr = metrics_lr.fMeasure(1.0, 1.0)
print("F1 Measure = %s" % f1Measure_lr)
print ("Test Accuracy = %s" %accuracy_lr)
print ("Test Error = %s" % (1-accuracy_lr))
print ("Precision = %s" %precision_lr)
print ("Recall = %s" %recall_lr)

F1 Measure = 0.372093023256
Test Accuracy = 0.644736842105
Test Error = 0.355263157895
Precision = 0.4
Recall = 0.347826086957


In [270]:
bin_lr=BinaryClassificationMetrics(predictionAndLabels_lr)

# Area under precision-recall curve
print("Area under PR = %s" % bin_lr.areaUnderPR)
# Area under precision-recall curve
print("Area under ROC = %s" % bin_lr.areaUnderROC)

Area under PR = 0.472597254005
Area under ROC = 0.560705496308


<a id="build_model_2"></a>
## Build Naive Bayes Model

We use the Pipeline of SparkML to build the Naive Bayes Model

In [271]:
# set up Naive Bayes using Pipeline of SparkML
tokenizer = Tokenizer(inputCol="PersonInfo", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features", numFeatures=32)
nb = NaiveBayes(labelCol="label", featuresCol="features", predictionCol="prediction", smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[tokenizer, hashingTF, nb])

In [272]:
# set up Naive Bayes Model
# the stages are executed in order
model = pipeline.fit(train)

<a id="test_data_2"></a>
## Naive Bayes Predictions for Test Data

In [273]:
# Make predictions on test documents and print columns of interest
prediction = model.transform(test)
selected = prediction.select("PersonInfo", "prediction", "probability")
for row in selected.collect():
    print row
#for row in prediction.collect():
#    print row

Row(PersonInfo=u'18 1 10 00 0', prediction=0.0, probability=DenseVector([0.8637, 0.1363]))
Row(PersonInfo=u'15 2 00 00 0', prediction=0.0, probability=DenseVector([0.7652, 0.2348]))
Row(PersonInfo=u'25 1 10 00 3', prediction=0.0, probability=DenseVector([0.7839, 0.2161]))
Row(PersonInfo=u'36 1 00 00 1', prediction=0.0, probability=DenseVector([0.8827, 0.1173]))
Row(PersonInfo=u'25 3 00 01 2', prediction=0.0, probability=DenseVector([0.6153, 0.3847]))
Row(PersonInfo=u'17 2 00 00 1', prediction=0.0, probability=DenseVector([0.8209, 0.1791]))
Row(PersonInfo=u'17 2 00 00 1', prediction=0.0, probability=DenseVector([0.8209, 0.1791]))
Row(PersonInfo=u'24 1 11 00 1', prediction=0.0, probability=DenseVector([0.5169, 0.4831]))
Row(PersonInfo=u'25 2 00 00 0', prediction=0.0, probability=DenseVector([0.8697, 0.1303]))
Row(PersonInfo=u'27 1 10 00 0', prediction=0.0, probability=DenseVector([0.6788, 0.3212]))
Row(PersonInfo=u'31 1 10 00 2', prediction=0.0, probability=DenseVector([0.7228, 0.2772]))

In [274]:
#Tabulate the predicted outcome
prediction.select("prediction").groupBy("prediction").count().show(truncate=False)

+----------+-----+
|prediction|count|
+----------+-----+
|0.0       |66   |
|1.0       |10   |
+----------+-----+



In [275]:
#Tabulate the actual outcome
prediction.select("label").groupBy("label").count().show(truncate=False)

+-----+-----+
|label|count|
+-----+-----+
|0.0  |53   |
|1.0  |23   |
+-----+-----+



In [276]:
# This table shows:
# 1. The number of low birth weight infants predicted as having low birth weight
# 2. The number of low birth weight infants predicted as not having low birth weight
# 3. The number of regular birth weight infants predicted as having low birth weight
# 4. The number of regular birth weight infants predicted as not having low birth weight

prediction.crosstab('label', 'prediction').show()

+----------------+---+---+
|label_prediction|0.0|1.0|
+----------------+---+---+
|             1.0| 16|  7|
|             0.0| 50|  3|
+----------------+---+---+



<a id="evaluate_model_2"></a>
## Evaluate Naive Bayes Model

We evaluate the model on a training set and on a test set.  The purpose is to measure the model's predictive accuracy, including the accuracy for new data.

In [277]:
# Evaluate the Naive Bayes model on a training set
# Select (prediction, true label) and compute training error
pred_nb=model.transform(train).select("prediction", "label")
eval_nb=MulticlassClassificationEvaluator (
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_nb=eval_nb.evaluate(pred_nb)
# create RDD
predictionAndLabels_nb=pred_nb.rdd
metrics_nb=MulticlassMetrics(predictionAndLabels_nb)
precision_nb=metrics_nb.precision(1.0)
recall_nb=metrics_nb.recall(1.0)
f1Measure_nb = metrics_nb.fMeasure(1.0, 1.0)
print("F1 Measure = %s" % f1Measure_nb)
print ("Training Accuracy = %s" %accuracy_nb)
print ("Training Error = %s" % (1-accuracy_nb))
print ("Precision = %s" %precision_nb)
print ("Recall = %s" %recall_nb)

F1 Measure = 0.526315789474
Training Accuracy = 0.763157894737
Training Error = 0.236842105263
Precision = 0.75
Recall = 0.405405405405


In [278]:
# Evaluate the Naive Bayes model on a test set
# Select (prediction, true label) and compute test error
pred_nb=model.transform(test).select("prediction", "label")
eval_nb=MulticlassClassificationEvaluator (
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_nb=eval_nb.evaluate(pred_nb)
# create RDD
predictionAndLabels_nb=pred_nb.rdd
metrics_nb=MulticlassMetrics(predictionAndLabels_nb)
precision_nb=metrics_nb.precision(1.0)
recall_nb=metrics_nb.recall(1.0)
f1Measure_nb = metrics_nb.fMeasure(1.0, 1.0)
print("F1 Measure = %s" % f1Measure_nb)
print ("Test Accuracy = %s" %accuracy_nb)
print ("Test Error = %s" % (1-accuracy_nb))
print ("Precision = %s" %precision_nb)
print ("Recall = %s" %recall_nb)

F1 Measure = 0.424242424242
Test Accuracy = 0.75
Test Error = 0.25
Precision = 0.7
Recall = 0.304347826087


In [279]:
bin_nb=BinaryClassificationMetrics(predictionAndLabels_nb)

# Area under precision-recall curve
print("Area under PR = %s" % bin_nb.areaUnderPR)
# Area under precision-recall curve
print("Area under ROC = %s" % bin_nb.areaUnderROC)

Area under PR = 0.607437070938
Area under ROC = 0.623872026251


<a id="build_model_3"></a>
## Build Decision Tree Model

We use the Pipeline of SparkML to build the Decision Tree Model

In [280]:
# set up Decision Tree using Pipeline of SparkML
tokenizer = Tokenizer(inputCol="PersonInfo", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features", numFeatures=32)
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
pipeline = Pipeline(stages=[tokenizer, hashingTF, dt])

In [281]:
# set up Decision Tree Model
# the stages are executed in order
model = pipeline.fit(train)

<a id="test_data_3"></a>
## Decision Tree Predictions for Test Data

In [282]:
# Make predictions on test documents and print columns of interest
prediction = model.transform(test)
selected = prediction.select("PersonInfo", "prediction", "probability")
for row in selected.collect():
    print row
#for row in prediction.collect():
#    print row

Row(PersonInfo=u'18 1 10 00 0', prediction=0.0, probability=DenseVector([0.7297, 0.2703]))
Row(PersonInfo=u'15 2 00 00 0', prediction=0.0, probability=DenseVector([0.9744, 0.0256]))
Row(PersonInfo=u'25 1 10 00 3', prediction=1.0, probability=DenseVector([0.4667, 0.5333]))
Row(PersonInfo=u'36 1 00 00 1', prediction=0.0, probability=DenseVector([0.9744, 0.0256]))
Row(PersonInfo=u'25 3 00 01 2', prediction=1.0, probability=DenseVector([0.4667, 0.5333]))
Row(PersonInfo=u'17 2 00 00 1', prediction=0.0, probability=DenseVector([0.9744, 0.0256]))
Row(PersonInfo=u'17 2 00 00 1', prediction=0.0, probability=DenseVector([0.9744, 0.0256]))
Row(PersonInfo=u'24 1 11 00 1', prediction=1.0, probability=DenseVector([0.0, 1.0]))
Row(PersonInfo=u'25 2 00 00 0', prediction=0.0, probability=DenseVector([1.0, 0.0]))
Row(PersonInfo=u'27 1 10 00 0', prediction=0.0, probability=DenseVector([0.7297, 0.2703]))
Row(PersonInfo=u'31 1 10 00 2', prediction=0.0, probability=DenseVector([0.7297, 0.2703]))
Row(PersonI

In [283]:
#Tabulate the predicted outcome
prediction.select("prediction").groupBy("prediction").count().show(truncate=False)

+----------+-----+
|prediction|count|
+----------+-----+
|0.0       |53   |
|1.0       |23   |
+----------+-----+



In [284]:
#Tabulate the actual outcome
prediction.select("label").groupBy("label").count().show(truncate=False)

+-----+-----+
|label|count|
+-----+-----+
|0.0  |53   |
|1.0  |23   |
+-----+-----+



In [285]:
# This table shows:
# 1. The number of low birth weight infants predicted as having low birth weight
# 2. The number of low birth weight infants predicted as not having low birth weight
# 3. The number of regular birth weight infants predicted as having low birth weight
# 4. The number of regular birth weight infants predicted as not having low birth weight

prediction.crosstab('label', 'prediction').show()

+----------------+---+---+
|label_prediction|0.0|1.0|
+----------------+---+---+
|             1.0| 14|  9|
|             0.0| 39| 14|
+----------------+---+---+



<a id="evaluate_model_3"></a>
## Evaluate Decision Tree Model

We evaluate the model on a training set and on a test set.  The purpose is to measure the model's predictive accuracy, including the accuracy for new data.

In [286]:
# Evaluate the Decision Tree model on a training set
# Select (prediction, true label) and compute training error
pred_nb=model.transform(train).select("prediction", "label")
eval_nb=MulticlassClassificationEvaluator (
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_nb=eval_nb.evaluate(pred_nb)
# create RDD
predictionAndLabels_nb=pred_nb.rdd
metrics_nb=MulticlassMetrics(predictionAndLabels_nb)
precision_nb=metrics_nb.precision(1.0)
recall_nb=metrics_nb.recall(1.0)
f1Measure_nb = metrics_nb.fMeasure(1.0, 1.0)
print("F1 Measure = %s" % f1Measure_nb)
print ("Training Accuracy = %s" %accuracy_nb)
print ("Training Error = %s" % (1-accuracy_nb))
print ("Precision = %s" %precision_nb)
print ("Recall = %s" %recall_nb)

F1 Measure = 0.724637681159
Training Accuracy = 0.833333333333
Training Error = 0.166666666667
Precision = 0.78125
Recall = 0.675675675676


In [287]:
# Evaluate the Decision Tree model on a test set
# Select (prediction, true label) and compute test error
pred_nb=model.transform(test).select("prediction", "label")
eval_nb=MulticlassClassificationEvaluator (
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_nb=eval_nb.evaluate(pred_nb)
# create RDD
predictionAndLabels_nb=pred_nb.rdd
metrics_nb=MulticlassMetrics(predictionAndLabels_nb)
precision_nb=metrics_nb.precision(1.0)
recall_nb=metrics_nb.recall(1.0)
f1Measure_nb = metrics_nb.fMeasure(1.0, 1.0)
print("F1 Measure = %s" % f1Measure_nb)
print ("Test Accuracy = %s" %accuracy_nb)
print ("Test Error = %s" % (1-accuracy_nb))
print ("Precision = %s" %precision_nb)
print ("Recall = %s" %recall_nb)

F1 Measure = 0.391304347826
Test Accuracy = 0.631578947368
Test Error = 0.368421052632
Precision = 0.391304347826
Recall = 0.391304347826


In [288]:
bin_nb=BinaryClassificationMetrics(predictionAndLabels_nb)

# Area under precision-recall curve
print("Area under PR = %s" % bin_nb.areaUnderPR)
# Area under precision-recall curve
print("Area under ROC = %s" % bin_nb.areaUnderROC)

Area under PR = 0.483409610984
Area under ROC = 0.563576702215
