# Develop a Scala Spark Model on Chicago Building Violations


This notebook models building violations in Chicago.
___________

The data are <a href="https://data.cityofchicago.org/Buildings/Building-Violations/22u3-xenr"  target="_blank" rel="noopener noreferrer">Violations issued by the Chicago Department of Buildings</a>
 over the period from 2006 until present. The dataset contains instances of `violations`. Each violation is associated with an `inspection` and an `inspection status`. 

Using Spark Machine Learning, we're going to develop a model for the data from 2006-2016 which provides a score in the interval $[0,1]$ for how likely we believe an individual building is to `Pass` or `Fail` an inspection. 

This notebook runs on Scala 2.11 with Spark.
______________

## Table of contents
1. [Wrangle data](#wrangle)
2. [Build a pipeline](#build)
3. [Train the model](#train)
4. [Save the model](#save)
5. [Deploy the model](#deploy)
6. [Summary](#summary)
_________________

<a id="wrangle"></a>
## 1. Wrangle data 

Read the data into a Spark DataFrame. 

1. [Import the libraries](#libraries)
2. [Load the credentials](#credentials)
3. [Load the data set](#dataset)
4. [Modify the data](#modify)

### 1.1 Import the libraries<a id="libraries"></a>

In [1]:
// Import top level
import scala.sys.process._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.util._
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.expressions.DateFormatClass
import com.ibm.ibmos2spark.bluemix



### 1.2 Load the credentials<a id='credentials'></a>
We need our credentials to work with Cloud Object Storage (COS) to read the data into a Spark DataFrame. 
To load our credentials and the data set:
1. Go to the <a href="https://data.cityofchicago.org/Buildings/Building-Violations/22u3-xenr" target="_blank" rel="noopener noreferrer">Violations issued by the Chicago Department of Buildings</a>. 
2. Click export, then download and save the data set `Building_Violations` as a.csv file to your computer.  
3. Load the .csv file into your notebook. Click the **Data** icon on the notebook action bar. Drop the file into the box or browse to select the file. The file is loaded to your COS and appears in the Data Assets section of the project. For more information, see <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/load-and-access-data.html" target="_blank" rel="noopener noreferrer">Load and access data</a>.
4. Load the credentials for the `Building_Violations.csv` file. Click in the next code cell and select **Insert to code > Insert credentials**.
6. Run the cell.

In [None]:
// Insert to code > Insert credentials.
// @hidden_cell


### 1.3 Load the data set<a id='dataset'></a>

1. Load the data from the `Building_Violations.csv` file into a SparkSession DataFrame. Click in the next code cell and select **Insert to code > Insert SparkSession DataFrame** under the file name.
2. Replace `dfData1` with `violations`.
3. Run the cell.

In [None]:
// @hidden_cell
// SparkSession DataFrame goes here


### 1.4 Modify the data <a id='modify'></a>

We’re not going to build a model on all of the data. We need to separate 2006–2016 from 2017. We will use a decade of data to train the model, and we will test the performance of our model on the 2017 data. 

In the above Schema, `VIOLATION DATE` is string type. This means we need to do some wrangling before we can filter by the dates in an intuitive way.

In [4]:
// Create datetime column
val dated = violations.withColumn("timeStamp", to_date(unix_timestamp(
  $"VIOLATION DATE", "MM/dd/yyyy"
).cast("timestamp")))


dated = [ID: int, VIOLATION LAST MODIFIED DATE: string ... 31 more fields]


[ID: int, VIOLATION LAST MODIFIED DATE: string ... 31 more fields]

Let’s make some more modifications:

- Rename all of the columns so that we can reference them more easily later 
- Remove the space between the names and replace it with an underscore

In [5]:
// sub whitespace for `_`
var cleanDf = dated
for(col <- dated.columns){
    cleanDf = cleanDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
}

cleanDf = [ID: int, VIOLATION_LAST_MODIFIED_DATE: string ... 31 more fields]


[ID: int, VIOLATION_LAST_MODIFIED_DATE: string ... 31 more fields]

We’re modeling `INSPECTION_STATUS`, but there are a small number of records where the status has not been resolved into `PASSED` or `FAILED`. Now we will:
- Select only those records that meet our criteria with `SQL Transformer`
- Change the datatype of ``LATITUDE`` and ``LONGITUDE`` from string to Double

In [6]:
import org.apache.spark.ml.feature.SQLTransformer
val df = new SQLTransformer().setStatement("SELECT * FROM __THIS__ WHERE INSPECTION_STATUS IN ('FAILED', 'PASSED')").transform(cleanDf)
val preppedFrame = df.withColumn("LATITUDE", df("LATITUDE").cast(DoubleType)).
                    withColumn("LONGITUDE", df("LONGITUDE").cast(DoubleType))

df = [ID: int, VIOLATION_LAST_MODIFIED_DATE: string ... 31 more fields]
preppedFrame = [ID: int, VIOLATION_LAST_MODIFIED_DATE: string ... 31 more fields]


[ID: int, VIOLATION_LAST_MODIFIED_DATE: string ... 31 more fields]

Next, separate the data by year:

In [7]:
// Filter by date. Train on  year < 2017, test on 2017 data
val trainingData2016 = preppedFrame.filter(year($"timestamp").leq(lit(2016))) 
val testingData2017 = preppedFrame.filter(year($"timestamp").gt(lit(2016))) 

trainingData2016 = [ID: int, VIOLATION_LAST_MODIFIED_DATE: string ... 31 more fields]
testingData2017 = [ID: int, VIOLATION_LAST_MODIFIED_DATE: string ... 31 more fields]


[ID: int, VIOLATION_LAST_MODIFIED_DATE: string ... 31 more fields]

**Note:** `leq` is `less-than-or-equal-to` . `gt` follows the same logic.

Now, we’ve represented the DataFrame with a new field, `timeStamp`. We can use this to filter the timestamp data intuitively.

In [8]:
// Take a peek
testingData2017.select("VIOLATION_DATE").show(3)

+--------------+
|VIOLATION_DATE|
+--------------+
|    06/12/2019|
|    06/12/2019|
|    06/12/2019|
+--------------+
only showing top 3 rows



In [9]:
// at the training data too
trainingData2016.select("VIOLATION_DATE").show(3)

+--------------+
|VIOLATION_DATE|
+--------------+
|    12/30/2016|
|    12/30/2016|
|    12/30/2016|
+--------------+
only showing top 3 rows



For simplicity, we choose only a subset of the fields to use for modeling. 
Many of the other fields have missing values, which is beyond the scope of this notebook. 

- Specify a subset of the columns
- Drop those rows which contain nulls

In [10]:
val keepCols = Array("VIOLATION_CODE", "VIOLATION_DESCRIPTION", 
                   "INSPECTION_STATUS", "INSPECTOR_ID", 
                   "INSPECTION_CATEGORY", "DEPARTMENT_BUREAU", 
                   "LATITUDE", "LONGITUDE")
val dfTrain = trainingData2016.select(keepCols.head, keepCols.tail: _*).na.drop
val dfTest = testingData2017.select(keepCols.head, keepCols.tail: _*).na.drop

keepCols = Array(VIOLATION_CODE, VIOLATION_DESCRIPTION, INSPECTION_STATUS, INSPECTOR_ID, INSPECTION_CATEGORY, DEPARTMENT_BUREAU, LATITUDE, LONGITUDE)
dfTrain = [VIOLATION_CODE: string, VIOLATION_DESCRIPTION: string ... 6 more fields]
dfTest = [VIOLATION_CODE: string, VIOLATION_DESCRIPTION: string ... 6 more fields]


[VIOLATION_CODE: string, VIOLATION_DESCRIPTION: string ... 6 more fields]

In [11]:
dfTrain.printSchema()

root
 |-- VIOLATION_CODE: string (nullable = true)
 |-- VIOLATION_DESCRIPTION: string (nullable = true)
 |-- INSPECTION_STATUS: string (nullable = true)
 |-- INSPECTOR_ID: string (nullable = true)
 |-- INSPECTION_CATEGORY: string (nullable = true)
 |-- DEPARTMENT_BUREAU: string (nullable = true)
 |-- LATITUDE: double (nullable = true)
 |-- LONGITUDE: double (nullable = true)



<a id="build"></a>
## 2. Build a pipeline
When you deploy a model to Watson Machine Learning, you need to provide a `Spark Machine Learning Pipeline`, which indicates how to transform raw data into the representation required by our model. 
Pipelines typically include a series of transformers and terminate with a model or, especially in classification tasks, some transformers which will convert model predictions into string labels.

1. [Import transformers](#transformer)
2. [Instantiate model object and pipeline](#instantiate)

### 2.1 Import transformers<a id="transformer"></a>

In [12]:
/* Import transformers to build a pipeline  */ 
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorAssembler}
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics


We’ll use the `StringIndexer` to convert strings into a numeric representation for the machine. 

You can read about many transformations in the <a href="https://spark.apache.org/docs/2.0.2/ml-features.html"  target="_blank" rel="noopener noreferrer">Spark documentation</a>. 

We assign each transformation a value because we’ll need to reference them later in the pipeline.

In the code cell below, notice that after creating a new instance of `StringIndexer` , we use `setInputCol` and `setOutputCol` . The output column goes into the `VectorAssembler`. We include all of those features we use for modeling in `VectorAssembler`.
But what about string data that is not categorical? Sure, we can index all of the `INSPECTOR_ID` data, but does that make sense for the `VIOLATION_DESCRIPTION`, where almost every field is unique? 
<br>

For text data like this, Scala and Spark provide other handy transformations. For `RegexTokenizer` and `HashingTF` the  idea is simple. The tokenizer takes the text and breaks it into individual words, called `tokens`. Map the tokens contained in each violation description to their frequencies. This allows us to accept unseen data as well.

In [13]:
// Label colum
val labelCol = new StringIndexer().setInputCol("INSPECTION_STATUS").setOutputCol("STATUS_LABEL").fit(df)

// Feature cols with String Indexer => Vector Assembler //

//* VIOLATION CODE * //
val interCodeCol = new StringIndexer().setInputCol("VIOLATION_CODE").setOutputCol("CODE_X").setHandleInvalid("skip")


//* INSPECTOR ID * //
val interSpector = new StringIndexer().setInputCol("INSPECTOR_ID").setOutputCol("INSP_X").setHandleInvalid("skip")


//* INSPECTION CATEGORY * //
val interCatSpector = new StringIndexer().setInputCol("INSPECTION_CATEGORY").setOutputCol("INCAT_X").setHandleInvalid("skip")


//* DEPARTMENT BUREAU * //
val interBureau = new StringIndexer().setInputCol("DEPARTMENT_BUREAU").setHandleInvalid("skip").setOutputCol("BUR_X")


//** DEALING WITH TEXT **//
val regexTokenizer = new RegexTokenizer().setInputCol("VIOLATION_DESCRIPTION").setOutputCol("WORD_X").setPattern("\\W")
val hashingTF = new HashingTF().setInputCol("WORD_X").setOutputCol("DESCRIPTION").setNumFeatures(150) // experiment with numFeatures + regularization params

// LAT AND LONG ARE NUMERIC //


//** VECTOR ASSEMBLER **//

val vecAssembler = new VectorAssembler().setInputCols(Array("BUR_X", "INCAT_X", "CODE_X", "INSP_X", "DESCRIPTION", "LATITUDE", "LONGITUDE")).setOutputCol("FEATURES")

labelCol = strIdx_8bf88cee6cfd
interCodeCol = strIdx_b11fc7033811
interSpector = strIdx_6f8f64a112ab
interCatSpector = strIdx_6a5e0b76aa91
interBureau = strIdx_1efc1c3bf7c4
regexTokenizer = regexTok_a69f069bd9ac
hashingTF = hashingTF_05c11333fe97
vecAssembler = vecAssembler_db4f136b8425


vecAssembler_db4f136b8425

### 2.2 Instantiate the model object and pipeline<a id="instantiate"></a>
Time to instantiate a new untrained model object and pipeline.

In [14]:
import org.apache.spark.ml.{Model, Pipeline, PipelineStage, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression

//** Logistic Regression **//
val logitModel = new LogisticRegression().setLabelCol("STATUS_LABEL").setFeaturesCol("FEATURES").setRegParam(0.1)


//** Convert index prediction back to string **//
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("PREDICTED_LABEL").setLabels(labelCol.labels)

logitModel = logreg_ab0b684fb5e6
labelConverter = idxToStr_b4faaf46f318


idxToStr_b4faaf46f318

Build the modeling pipeline.

In [15]:
/* Logitic Regression Pipeline */ 
val logisticPipe = new Pipeline().setStages(
                                    Array(
                                        labelCol, 
                                        interCodeCol, 
                                        interSpector, 
                                        interCatSpector,
                                        interBureau,
                                        regexTokenizer, hashingTF,
                                        vecAssembler,
                                        logitModel                                                                  
                                    )
                                )

logisticPipe = pipeline_2c6504b03396


pipeline_2c6504b03396

<a id="train"></a>
## 3. Train the models
Call `.fit()` on the pipe.

In [16]:
val trainedLogit = logisticPipe.fit(dfTrain)

trainedLogit = pipeline_2c6504b03396


pipeline_2c6504b03396

We can make predictions and get metrics.

In [17]:
// predict
val predictionsLogisitc = trainedLogit.transform(dfTest)


// Prepare for metrics
val predictionAndLabels = predictionsLogisitc.select("STATUS_LABEL", "prediction").rdd.map(row => 
            (row.getAs[Double]("prediction"), row.getAs[Double]("STATUS_LABEL")))

val metrics = new BinaryClassificationMetrics(predictionAndLabels)

predictionsLogisitc = [VIOLATION_CODE: string, VIOLATION_DESCRIPTION: string ... 17 more fields]
predictionAndLabels = MapPartitionsRDD[746] at map at <console>:90
metrics = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@33039935


org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@33039935

We have a new object `metrics`, which contains a lot of information.

In [18]:
// AUC
metrics.areaUnderROC

0.6090319971611869

An area of .5 under the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic"  target="_blank" rel="noopener noreferrer">ROC Curve</a> indicates that the model performs as well as random guessing, so we’ve beaten that enough to continue for purposes of this tutorial.

<a id="save"></a>
## 4. Save the model
You’ll need an instance of [Watson Machine Learning](https://developer.ibm.com/clouddataservices/docs/ibm-watson-machine-learning/) to save your model. 

You can create a new instance directly from within Watson Studio, but you’ll need to log in to IBM Cloud for your credentials. IBM Cloud offers many powerful services and several are available at little to no cost, including Watson Machine Learning (WML). With WML, you have a repository to store and deploy your models. Consult <a href="https://developer.ibm.com/clouddataservices/docs/ibm-watson-machine-learning/get-started/"  target="_blank" rel="noopener noreferrer">IBM Developer resources</a> to help you get started if you haven't already. You can also check the companion <a href="https://medium.com/p/91b580450c5b/edit"  target="_blank" rel="noopener noreferrer">blog</a> for more details.

Import the <a href="https://watson-ml-libs.mybluemix.net/repository-scalaV3/#com.ibm.analytics.ngp.repository_v3.package"  target="_blank" rel="noopener noreferrer">IBM Scala Repository API Client for Watson Machine Learning</a> and other helpful libraries.

In [19]:
import com.ibm.analytics.ngp.repository_v3._

// Helper libraries

import scalaj.http.{Http, HttpOptions}
import scala.util.{Success, Failure}
import java.util.Base64
import java.nio.charset.StandardCharsets
import play.api.libs.json._

Now, fetch your credentials. The WML credentials enable you to communicate with your repository via the internet. 
Learn more about <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-setup.html?audience=wdp&context=wdp&linkInPage=true"  target="_blank" rel="noopener noreferrer">retrieving service credentials</a> in the Watson Studio documentation.

In [None]:
import scala.collection.mutable.HashMap
val wmlCredentials: HashMap[String, String] = HashMap(
    "url"->"***",
    "username"->"***",
    "password"->"***",
    "instance_id"->"***",
    "apikey"->"***"
)

In [None]:
val service_path = wmlCredentials("url")
val username = wmlCredentials("username")
val password = wmlCredentials("password")

Let’s make a connection and authorize.

In [None]:
// Authorize
val client = MLRepositoryClient(service_path)
client.authorize(username, password)

Next, use `MLRepositoryArtifact` to create a model artifact for the repository. We must pass:
- A Spark ML pipeline: `trainedLogit`
- The training data used: `dfTrain`
- A name for the model:`VIOLATIONS_SCALA211_SPARK20`.

In [None]:
// model artifact
val model_artifact = MLRepositoryArtifact(trainedLogit, dfTrain, "VIOLATIONS_SCALA211_SPARK20")
val saved = client.models.save(model_artifact)

`saved` is the model artifact. Check it out:

In [25]:
println("modelType: " + saved.get.meta.prop("modelType"))
println("trainingDataSchema: " + saved.get.meta.prop("trainingDataSchema"))
println("creationTime: " + saved.get.meta.prop("creationTime"))
println("modelVersionHref: " + saved.get.meta.prop("modelVersionHref"))
println("label: " + saved.get.meta.prop("label"))
println("runtime: "+ saved.get.meta.prop("runtime"))

modelType: Some(standard)
trainingDataSchema: Some({"type":"struct","fields":[{"name":"VIOLATION_CODE","type":"string","nullable":true,"metadata":{}},{"name":"VIOLATION_DESCRIPTION","type":"string","nullable":true,"metadata":{}},{"name":"INSPECTION_STATUS","type":"string","nullable":true,"metadata":{"modeling_role":"target"}},{"name":"INSPECTOR_ID","type":"string","nullable":true,"metadata":{}},{"name":"INSPECTION_CATEGORY","type":"string","nullable":true,"metadata":{}},{"name":"DEPARTMENT_BUREAU","type":"string","nullable":true,"metadata":{}},{"name":"LATITUDE","type":"double","nullable":true,"metadata":{}},{"name":"LONGITUDE","type":"double","nullable":true,"metadata":{}}]})
creationTime: Some(2019-06-14T15:01:17.729Z)
modelVersionHref: None
label: Some(INSPECTION_STATUS)
runtime: Some(spark-2.3)


<a id="deploy"></a>
## 5. Deploy the model from within your Watson Studio project

Now that we’ve saved the model, we can deploy and create an API endpoint to score records. 


1. Go to **Assets** in your Project, then find your model under **Models** and click on it. We named ours `VIOLATIONS_SCALA211_SPARK20`.
2. Select **Deployment > Add Deployment**.
3. Enter a name and description for the deployment, click **Save**.
4. Select the model, click **Test** to test the API. 

See the Watson Studio <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-deploy_new.html" target="_blank" rel="noopener noreferrer">documentation</a> for more information about how to deploy a model within a project.
____________

## 6. Summary<a id="summary"></a>
That's awesome! We built a model with Scala and Spark on Watson Studio Cloud. 



### Citations

City of Chicago (2017). Building Violations <a href=https://data.cityofchicago.org/Buildings/Building-Violations/22u3-xenr target="_blank" rel="noopener noreferrer">https://data.cityofchicago.org/Buildings/Building-Violations/22u3-xenr</a>  Chicago, IL: Chicago City Data Portal.

### Author
**Adam Massachi** is a Data Scientist with the Watson Studio and IBM Watson teams at IBM. Before IBM, he worked on political campaigns, building and managing large volunteer operations and organizing campaign finance initiatives. Say hello <a href="https://twitter.com/adammassach?lang=en"  target="_blank" rel="noopener noreferrer">@adammassach</a>!

Copyright © IBM Corp. 2018, 2019. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>