<img src='https://raw.githubusercontent.com/dxkikuchi/SparkSnips/master/dsxbanner.jpg' width='75%'></img>

IBM Data Science Experience is an interactive, collaborative, cloud-based environment where data scientists can use multiple tools to activate their insights.  Data scientists can use the best of open source, tap into IBM's unique features, grow their capabilities, and share their successes.  In addition to all the features in the current preview, many new capabilities are being added including the ability to ingest Object Storage data with a single click, an enhanced user interface for version control, a facility to comment or chat about a notebook with others, and many more!

## New York State Restaurant Inspections Notebook
This notebook will provide insights from official restaurant inspection records for most of New York state and provide visualizations of that data.  This data is available at <a href="https://health.data.ny.gov/Health/Food-Service-Establishment-Last-Inspection/cnih-y5dw" target="_blank">New York State Food Service Establishment: Last Inspection</a>.  A raw extract was taken in October of 2016 and is located at http://ibm.biz/nyrestaurantsdata.

In [6]:
nyrdata = 'http://ibm.biz/nyrestaurantsdata'

The csv (comma separated values) data will be read into a Pandas dataframe (nyr) and the first 5 records are displayed using the 'head()' method.<br>

In [7]:
import pandas as pd
nyr = pd.read_csv(nyrdata)
nyr.head()

Unnamed: 0,FACILITY,ADDRESS,LAST INSPECTED,VIOLATIONS,TOTAL # CRITICAL VIOLATIONS,TOTAL #CRIT. NOT CORRECTED,TOTAL # NONCRITICAL VIOLATIONS,DESCRIPTION,LOCAL HEALTH DEPARTMENT,COUNTY,...,PERMIT EXPIRATION DATE,PERMITTED (D/B/A),PERMITTED CORP. NAME,PERM. OPERATOR LAST NAME,PERM. OPERATOR FIRST NAME,NYS HEALTH OPERATION ID,INSPECTION TYPE,INSPECTION COMMENTS,FOOD SERVICE FACILITY STATE,Location1
0,QUEEN CITY ELKS LODGE #174,"726 BENJAMIN STREET, ELMIRA",09/12/2014,"Item 8A- Food not protected during storage,...",0.0,0.0,5.0,Food Service Establishment - Food Service Esta...,Chemung County,CHEMUNG,...,06/15/2017,,QUEEN CITY LODGE #174,DAVIS,WENDY,265599,Inspection,Probe thermometer observed next to fryers. Sa...,NY,"(42.099262, -76.805597)"
1,PEARL RIVER HOOK & LADDER CO.,"50 EAST CENTRAL AVENUE, PEARL RIVER",09/16/2014,Item 12C- Plumbing and sinks not properly si...,0.0,0.0,2.0,Food Service Establishment - Food Service Esta...,Rockland County,ROCKLAND,...,09/30/2018,,"PEARL RIVER HOOK AND LADDER CO. NO 1, INC.",HANSEN,RON,683793,Inspection,,NY,"(41.059177, -74.019413)"
2,MEX,"295 ALEXANDER STREET, ROCHESTER",09/15/2014,"Item 8A- Food not protected during storage,...",0.0,0.0,19.0,Food Service Establishment - Restaurant,Monroe County,MONROE,...,12/31/2016,,,,,667739,Inspection,,NY,"(43.154157, -77.59494)"
3,WHITNEY POINT SCHOOL CONCESS,"10 KEIBEL ROAD, WHITNEY POINT",10/02/2014,Item 8E- Accurate thermometers not availabl...,0.0,0.0,1.0,Food Service Establishment - Food Service Esta...,Broome County,BROOME,...,08/31/2017,,,WHITNEY POINT SCHOOL,,256949,Inspection,,NY,"(42.337851, -75.975476)"
4,COOKIE GIRL BAKE SHOP,"191 SOUTH MAIN STREET, NEW CITY",11/04/2014,No violations found.,0.0,0.0,0.0,Food Service Establishment - Food Service Esta...,Rockland County,ROCKLAND,...,11/30/2016,,"COOKIE GIRL, INC.",DAMESEK,STACEY,750841,Inspection,,NY,"(41.140763, -73.990604)"


Ingesting data can be as simple as using one line of code.  Similarly, data can be ingested from Cloudant, DashDB, Object Storage, relational databases, and many others.

Another dataframe will be created that will only contain the columns that are pertinent.  The 'head()' method will display the first 5 records of this dataframe.

In [8]:
nyrcols = nyr[['FACILITY','TOTAL # CRITICAL VIOLATIONS','Location1']]
nyrcols.head()

Unnamed: 0,FACILITY,TOTAL # CRITICAL VIOLATIONS,Location1
0,QUEEN CITY ELKS LODGE #174,0.0,"(42.099262, -76.805597)"
1,PEARL RIVER HOOK & LADDER CO.,0.0,"(41.059177, -74.019413)"
2,MEX,0.0,"(43.154157, -77.59494)"
3,WHITNEY POINT SCHOOL CONCESS,0.0,"(42.337851, -75.975476)"
4,COOKIE GIRL BAKE SHOP,0.0,"(41.140763, -73.990604)"


The data will be transformed into a Spark dataframe 'nyrDF' and a table will be registered.  Spark dataframes are conceptually equivalent to a table in a relational database or a dataframe in R/Python, but with richer optimizations under the hood.  A table that is registered can be used in subsequent SQL statements.

In [9]:
!pip install --user pyspark

Requirement not upgraded as not directly required: pyspark in /home/dsxuser/.local/lib/python3.5/site-packages
Requirement not upgraded as not directly required: py4j==0.10.6 in /home/dsxuser/.local/lib/python3.5/site-packages (from pyspark)


In [10]:
from pyspark.sql import SparkSession, SQLContext

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
nyrDF = spark.createDataFrame(nyrcols)
nyrDF.registerTempTable("nyrDF")

Now, a Spark dataframe 'nyvDF' will be created using SQL that will contain the restaurant name (FACILITY), latitude, longitude and violations.  Note that the latitude and longitude are combined in the final column (Location1) of the retrieved data.  They will be extracted separately using regular expressions in the SQL.  The results are ordered by number of violations in descending order and the top 10 are displayed.

In [11]:
sqlContext = spark.builder.getOrCreate()
query = """
select 
    FACILITY, 
    trim(regexp_extract(location1, '(\\\()(.*),(.*)(\\\))',2)) as lat, 
    trim(regexp_extract(location1, '(\\\()(.*),(.*)(\\\))',3)) as lon,
    cast(`TOTAL # CRITICAL VIOLATIONS` as int) as Violations
from nyrDF 
order by Violations desc
limit 1000
"""

nyvDF = sqlContext.sql(query)
nyvDF.show(10)

+--------------------+---------+----------+----------+
|            FACILITY|      lat|       lon|Violations|
+--------------------+---------+----------+----------+
|LANSMANS CAFE    ...| 41.76024|   -74.598|        23|
|BRUEGGER'S BAGEL ...|43.086952| -77.63698|        17|
|HOLIDAY MOUNTAIN ...| 41.61992| -74.63633|        14|
|CARL R'S CAFE    ...|43.297719|-73.677516|        14|
|LOBSTER POT RESTA...|43.421017| -73.71429|        10|
|CAMP KINGSLEY - C...|43.399153|-75.532777|        10|
|CAPTAIN JACKS GOO...|43.269988|-76.980102|         9|
|BRICKHOUSE PIZZA ...|42.856533|-73.783211|         8|
|EL DORADO WEST   ...| 41.06001| -73.86167|         8|
|EMPIRE PIZZA     ...| 43.30909|-73.644628|         8|
+--------------------+---------+----------+----------+
only showing top 10 rows



Brunel visualization will be used to map the latitude and longitude to a New York state map.  Colors represent the number of violations as noted in the key.

In [12]:
import brunel
nyvPan = nyvDF.toPandas()
%brunel map ('NY') + data('nyvPan') x(lon) y(lat) color(Violations) tooltip(FACILITY)

<IPython.core.display.Javascript object>

One of the many key strengths of Data Science Experience is the ability to easily search and quickly learn about various topics.  For example, to find articles, tutorials or notebooks on Brunel, click on the 'link' icon on the top right hand corner of this web page ('Find Resources in the Commuity').  A side palette will appear where you can enter 'Brunel' or other topics of interest.  Related articles, tutorials, notebooks, data cards will appear.

Pixiedust provides charting and visualization.  It is an open source Python library that works as an add-on to Jupyter notebooks to improve the user experience of working with data.  Please execute the next cell for a tabular view of the data.

In [14]:
from pixiedust.display import *
display(nyvDF)

If you hover over the lonely lighter colored dot in the middle of the New York State map, you can see that it is for 'CAMP KINGSLEY - CC'.  By starting to type the value 'camp' in the 'Search table' text field above, the record will be displayed.  Numerous visualization are available with support for maps in the future.  Please take a look at the histogram of this data for another insight.  In addition, the data can be downloaded as a file, or stashed to Cloudant or Object Storage.

In just a few notebook cells, data was ingested, manipulated, visualized and yielded insights.  Much more capability, including machine learning, could be leveraged with IBM Data Science Experience.  This is just the tip of the iceberg!