
# Using PixieDust for Fast, Flexible, and Easier Data Analysis and Experimentation  

> Interactive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, you use Watson Studio to create a notebook to explore and visualize data to gain insight. We use PixieDust, an open source Python notebook helper library, to visualize the data in different ways (for example, charts, maps, etc.) with one simple call.  

![pixiedust](https://developer.ibm.com/clouddataservices/wp-content/uploads/sites/85/2017/03/pixiedust200.png)

You can access the complete tutorial with step by step instructions <a href="http://ibm.biz/pixiedustlab" target="_blank" rel="noopener no referrer">here.</a>  

This notebook runs on Python 3.5 with Spark.

## Table of contents
1. [Import PixieDust](#install)
2. [Load the data](#loaddata)
3. [View and visualize the data](#viewdata)
4. [Map the data](#mapdata)
5. [Summary](#summary)


## 1. Import PixieDust<a class="anchor" id="install"></a>
Before you can use the PixieDust library it must be imported into the notebook.

In [1]:
import pixiedust

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190517213121-0000
KERNEL_ID = 9db3e977-5039-4a25-847d-e19fe73a242d
Pixiedust database opened successfully
Table VERSION_TRACKER created successfully
Table METRICS_TRACKER created successfully

Share anonymous install statistics? (opt-out instructions)

PixieDust will record metadata on its environment the next time the package is installed or updated. The data is anonymized and aggregated to help plan for future releases, and records only the following values:

{
   "data_sent": currentDate,
   "runtime": "python",
   "application_version": currentPixiedustVersion,
   "space_id": nonIdentifyingUniqueId,
   "config": {
       "repository_id": "https://github.com/ibm-watson-data-lab/pixiedust",
       "target_runtimes": ["Data Science Experience"],
       "event_id": "web",
       "event_organizer": "dev-journeys"
   }
}
You can opt out by calling pixiedust.optOut() in a new cell.


Pixiedust runtime updated. Please restart kernel
Table SPARK_PACKAGES created successfully
Table USER_PREFERENCES created successfully
Table service_connections created successfully


## 2. Load the data<a class="anchor" id="loaddata"></a>
With PixieDust, you can easily load CSV data from a URL into a PySpark DataFrame in the notebook. 
In this example, we load a data set with information about restaurant inspections.

In [2]:
inspections = pixiedust.sampleData("https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv")

Downloading 'https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv' from https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv
Downloaded 364220 bytes
Creating pySpark DataFrame for 'https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv'. Please wait...
Loading file using 'SparkSession'
Successfully created pySpark DataFrame for 'https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv'


## 3. View and visualize the data<a class="anchor" id="view"></a>
Use PixieDust's **`display`** API to easily view and visualize the data.

3.1 [Filter the data set](#filter)<br/>
3.1 [Visualize the number of restaurants by category](#category)<br/>
3.2 [Visualize average number of inspection demerits per category clustered by the inspection grade](#inspectiongrade)<br/>
3.3 [Visualize current demerits vs inspection demerits](#demerits)

In [3]:
display(inspections)

:@computed_region_tnyv_z3b7,address,category_name,city,current_demerits,current_grade,date_current,employee_id,inspection_date,inspection_demerits,inspection_grade,inspection_result,inspection_time,inspection_type,location_1,location_1_address,location_1_city,location_1_state,location_1_zip,location_name,permit_number,permit_status,record_updated,restaurant_name,serial_number,state,violations,zip
,4000 W Flamingo Rd,Bar / Tavern,Las Vegas,3,A,2017-08-07 00:00:00,EE7001214,2017-08-07 00:00:00,3,A,'A' Grade,2017-08-07 13:20:00,Routine Inspection,POINT (115.1929046 36.1166353),,,,,Gold Coast Hotel & Casino,PR0006764,A,2017-08-07 16:51:28,Gold Coast Island Bar,DA16C7JRS,Nevada,215,89103-4004
,3330 W Tropicana Ave,Restaurant,Las Vegas,0,A,2017-08-08 00:00:00,EE7000594,2017-08-08 00:00:00,0,A,'A' Grade,2017-08-08 12:15:00,Routine Inspection,POINT (115.1849824 36.1016875),,,,,WILD WILD WEST CASINO,PR0009663,A,2017-08-08 13:29:05,Wild Wild West Employee Dining Room,DAIYKQCWF,Nevada,,89103
,9310 S EASTERN,Bar / Tavern,Henderson,6,A,2017-08-03 00:00:00,EE7001118,2017-08-03 00:00:00,6,A,'A' Grade,2017-08-03 14:15:00,Routine Inspection,POINT (115.117206 36.0195687),,,,,"GREAT AMERICAN PUB, THE",PR0120869,A,2017-08-03 14:49:49,"GREAT AMERICAN PUB BAR, THE",DANMKI4CM,Nevada,2152955,89123
,3730 S Las Vegas Blvd,Buffet,Las Vegas,3,A,2017-08-07 00:00:00,EE7001186,2017-08-07 00:00:00,3,A,'A' Grade,2017-08-07 12:45:00,Routine Inspection,POINT (115.1765836 36.1073485),,,,,ARIA HOTEL & CASINO,PR0024004,A,2017-08-07 14:15:33,ARIA EDR BUFFET,DAVWJD2RG,Nevada,29122955,89109
,3663 S Las Vegas Blvd 445,Bar / Tavern,Las Vegas,3,A,2018-01-24 00:00:00,EE7001217,2017-08-07 00:00:00,3,A,'A' Grade,2017-08-07 13:20:00,Routine Inspection,POINT (115.1727107 36.1094088),,,,,FAT TUESDAY @ MIRACLE MILE,PR0013481,A,2017-08-08 06:58:45,FAT TUESDAY,DAXR5BDB2,Nevada,21229112912292829292930,89109-1920
,3200 S Las Vegas Blvd 1144,Special Kitchen,Las Vegas,9,A,2019-01-02 00:00:00,EE7001343,2019-01-02 00:00:00,9,A,'A' Grade,2019-01-02 13:20:00,Routine Inspection,POINT (115.1730081 36.1282071),,,,,EL SEGUNDO SOL @ FASHION SHOW,PR0014425,,2019-01-02 15:27:49,EL SEGUNDO SOL MAIN PREP,DA0H4WYAO,Nevada,21129082927292929302955,89109-2612
,2000 N Fashion Show Dr,Restaurant,Las Vegas,0,A,2018-01-29 00:00:00,EE7001186,2017-08-09 00:00:00,5,A,'A' Grade,2017-08-09 08:00:00,Routine Inspection,POINT (115.1725941 36.1295451),,,,,TRUMP INTL HOTEL & TOWER,PR0014974,A,2017-08-09 08:53:47,TRUMP TOWER 1 EDR,DAOFSDWE5,Nevada,29282954,89109-1936
,4360 Blue Diamond Rd 108,Restaurant,Las Vegas,8,A,2017-12-11 00:00:00,EE7001290,2017-07-20 00:00:00,34,C,'C' Downgrade,2017-07-20 13:30:00,Routine Inspection,POINT (115.1997309 36.0363665),,,,,Ohana Hawaiian BBQ,PR0020435,C,2017-07-21 08:50:13,OHANA HAWAIIAN BBQ - RESTAURANT,DAEOWKL0A,Nevada,2022092112122132152162172909292829302955,89139-7783
,5601 N Tenaya Way 105,Restaurant,Las Vegas,0,A,2017-02-21 00:00:00,EE7000327,2017-02-21 00:00:00,0,A,'A' Grade,2017-02-21 10:55:00,Re-inspection,POINT (115.25117 36.263871),,,,,A MANO,PR0107003,A,2017-02-21 11:17:23,A MANO,DAKKEHD43,Nevada,2928,89130
,2211 S MARYLAND Pkwy 2,Restaurant,Las Vegas,0,A,2017-08-09 00:00:00,EE7001208,2017-03-24 00:00:00,5,A,'A' Grade,2017-03-24 11:50:00,Re-inspection,POINT (115.1366586 36.1457682),,,,,SAN SALVADOR RESTAURANT #2,PR0112871,A,2017-03-24 12:38:11,SAN SALVADOR #2 RESTAURANT,DANUWHJT3,Nevada,20929282930,89104


### 3.1 Filter the data set<a class="anchor" id="filter"></a>
Filter the data set to create a subset of only the Las Vegas restaurants.

In [4]:
inspections.registerTempTable("restaurants")
lasDF = sqlContext.sql("SELECT * FROM restaurants WHERE city='Las Vegas'")
lasDF.count()

818

<br/>  
### 3.2 Visualize the number of restaurants by category<a class="anchor" id="category"></a>  

Now display the number of restaurants for each category:

1. Click the **Chart** dropdown menu, then choose **Bar Chart**.
2. From the **Chart Options** dialog:
	1. Drag the **`category_name`** field and drop it into the **Keys** area.
	2. Drag the **`count`** field and drop it into the **Values** area.
	3. Set the **# of Rows to Display** to 1000.
	4. Click **OK**.
3. Click the **Renderer** dropdown menu, then choose **bokeh**.
4. Toggle the **Show Legend** Bar Chart Option to show or hide the legend.


In [None]:
!pip install -U --no-deps bokeh 

In [6]:
# Number of restaurants by categories

bycat = lasDF.groupBy("category_name").count()
display(bycat)

<br/>  

### 3.3 Visualize the average number of inspection demerits per category clustered by the inspection grade<a class="anchor" id="inspectiongrade"></a>  

1. Click the Chart dropdown menu and choose **Bar Chart**.
2. From the **Chart Options** dialog:
	1. Drag the **`category_name`** field and drop it into the **Keys** area.
	2. Drag the **`inspection_demerits`** field and drop it into the **Values** area.
	3. Set the **Aggregation** to AVG.
	4. Set the **# of Rows to Display** to 1000. 
	5. Click **OK**.
3. Click the **Renderer** dropdown menu and choose **bokeh**.
4. Click the **Cluster By** dropdown menu and choose **inspection_grade**.
5. Click the **Type** dropdown menu and choose the desired bar type, for example, **stacked**.

### 3.4 Visualize current demerits vs inspection demerits <a class="anchor" id="demerits"></a>

1. From the **Chart Options** dialog:
	1. Set the **Keys** to **`inspection_demerits`**.
	2. Set the **Values** to **`current_demerits`**.
	3. Set the **# of Rows to Display** to 1000.
	4. Click **OK**.
2. Click the Chart dropdown menu and choose **Scatter Plot**.
3. Select **bokeh** from the **Renderer** dropdown menu.
4. Select **inspection_grade** from the **Color** dropdown menu.

In [None]:
display(lasDF)

## 4. Map the data<a class="anchor" id="mapdata"></a>  

Now visualize the restaurant inspection data together with the restaurant location on a map.

Currently, PixieDust has two map renderers, Google and MapBox and a token is required for the map renderers for them to display correctly. For this section of the tutorial, you use the **MapBox** renderer and you need to create a <a href="https://www.mapbox.com/help/create-api-access-token/" target="_blank" rel="noopener no referrer">MapBox API Access Token</a> if you choose to continue.

4.1 [Create longitude and latitude fields](#longlat)<br/>
4.1 [Display the map](#viewmap)

### 4.1 Create longitude and latitude fields<a class="anchor" id="longlat"></a> 

The current data includes the longitude and latitude in the **`location_1`** field as a string like: `POINT (-114.923505 36.114434)`

However, the current map renderers in PixieDust expect the longitude and latitude as separate number fields, so the first thing you need to do is parse the **`location_1`** field into separate longitude and latitude number fields.

In [8]:
# Parse the location_1 field into separate longitude and latitude number fields

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def valueToLon(value):
    lon = float(value.split('POINT (')[1].strip(')').split(' ')[0])
    return None if lon == 0 else lon if lon < 0 else (lon * -1)

def valueToLat(value):
    lat = float(value.split('POINT (')[1].strip(')').split(' ')[1])
    return None if lat == 0 else lat

udfValueToLon = udf(valueToLon, DoubleType())
udfValueToLat = udf(valueToLat, DoubleType())

lonDF = lasDF.withColumn("lon", udfValueToLon("location_1"))
lonlatDF = lonDF.withColumn("lat", udfValueToLat("location_1"))

lonlatDF.printSchema()

root
 |-- :@computed_region_tnyv_z3b7: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- category_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- current_demerits: integer (nullable = true)
 |-- current_grade: string (nullable = true)
 |-- date_current: timestamp (nullable = true)
 |-- employee_id: string (nullable = true)
 |-- inspection_date: timestamp (nullable = true)
 |-- inspection_demerits: integer (nullable = true)
 |-- inspection_grade: string (nullable = true)
 |-- inspection_result: string (nullable = true)
 |-- inspection_time: timestamp (nullable = true)
 |-- inspection_type: string (nullable = true)
 |-- location_1: string (nullable = true)
 |-- location_1_address: string (nullable = true)
 |-- location_1_city: string (nullable = true)
 |-- location_1_state: string (nullable = true)
 |-- location_1_zip: string (nullable = true)
 |-- location_name: string (nullable = true)
 |-- permit_number: string (nullable = true)
 |-- permit_

<br/>  

### 4.2 Display the map<a class="anchor" id="viewmapdata"></a>  

Now you have separate **`longitude`** and **`latitude`** fields, perform the the following steps to display the data on a map:

1. Click the **Chart** dropdown menu, then choose **Map**.
2. From the **Chart Options** dialog:
	1. Drag the **`lon`** field and the **`lat`** field and drop it into the **Keys** area.
	2. Drag the **`current_demerits`** field and drop it into the **Keys** area.
	3. Set the **# of Rows to Display** to 1000. 
	4. Enter your access token from MapBox into the **MapBox Access Token** field.
	5. Click **OK**.
3. Click the **kind** dropdown menu and choose **choropleth**.


In [9]:
display(lonlatDF)

## 5. Summary<a class="anchor" id="summary"></a>
Using a data set about restaurant inspections, this notebook has introduced you to how you can use a simple PixieDust call to easily  visualize the data in a variety of different ways to provide you with useful insights.

### Author

**Va Barbosa** is an IBM development advocate.

<hr>
Copyright © IBM Corp. 2017-2019. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>