# Static data analysis using Python, Apache Spark,  and PixieDust
***

In this notebook, you'll first analyze customer demographics, such as, age, gender, income, and location. Then you'll combine that data with sales data to examine trends for product categories, transaction types, and product popularity. You'll load data from a previous notebook as well as from a public open data set, cleanse, shape, and enrich the data, and then visualize the data with the PixieDust library. Don't worry! PixieDust graphs don't require coding. By the end of the notebook, you'll understand how to combine data to gain insights about which customers you might target to increase sales.

<img src="https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/images/part_1.png"></img>

This notebook runs on Python 2 with Spark 2.1, and PixieDust 1.1.10.

<a id="toc"></a>
## Table of contents

#### [Setup](#Setup)
[Load data into the notebook](#Load-data-into-the-notebook)
#### [Part 1. Explore customer demographics](#part1)
[Prepare the customer data set](#Prepare-the-customer-data-set)<br>
[Visualize customer demographics and locations](#Visualize-customer-demographics-and-locations)<br>
[Enrich demographic information with open data](#Enrich-demographic-information-with-open-data)<br>   

#### [Summary and next steps](#summary)

## Setup
You need to import libraries and load the customer data into this notebook.

- Install the most current packages so we can take advantage of the latest features.

In [1]:
# run this cell
# jinja2 version 2.10 is required
#! pip install jinja2 --user --upgrade
# pixiedust version 1.1.7.1 (or above) is required
#! pip install pixiedust --user --upgrade
#!pip install -U --no-deps bokeh

> **If any package was updated restart the kernel and reload the browser page.**

Import the necessary libraries:

In [2]:
import pixiedust
import pyspark.sql.functions as func
import pyspark.sql.types as types
import re
import json
import os
import requests  

Pixiedust database opened successfully


### Load data into the notebook

The data file contains both the customer demographic data that you'll analyzed in Part 1, and the sales transaction data for Part 2.

In [3]:
raw_df = pixiedust.sampleData('https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv')

Downloading 'https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv' from https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv
Downloaded 5648773 bytes
Creating pySpark DataFrame for 'https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv'. Please wait...
Loading file using 'SparkSession'
Successfully created pySpark DataFrame for 'https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv'


[Back to Table of Contents](#toc)
<a id="part1"></a>
# Part 1. Explore customer demographics 
In this part of the notebook, you'll prepare the customer data and then start learning about your customers by creating multiple charts and maps. 

## Prepare the customer data set
You'll create a new DataFrame with just the data you need and then cleanse and enrich the data.

Extract the columns that you want, remove duplicate customers, and add a column for aggregations:

In [4]:
# Extract the customer information from the data set
# CUSTNAME: string, GenderCode: string, ADDRESS1: string, CITY: string, STATE: string, COUNTRY_CODE: string, POSTAL_CODE: string, POSTAL_CODE_PLUS4: int, ADDRESS2: string, EMAIL_ADDRESS: string, PHONE_NUMBER: string, CREDITCARD_TYPE: string, LOCALITY: string, SALESMAN_ID: string, NATIONALITY: string, NATIONAL_ID: string, CREDITCARD_NUMBER: bigint, DRIVER_LICENSE: string, CUST_ID: int,
customer_df = raw_df.select("CUST_ID", 
                            "CUSTNAME", 
                            "ADDRESS1", 
                            "ADDRESS2", 
                            "CITY", 
                            "POSTAL_CODE", 
                            "POSTAL_CODE_PLUS4", 
                            "STATE", 
                            "COUNTRY_CODE", 
                            "EMAIL_ADDRESS", 
                            "PHONE_NUMBER",
                            "AGE",
                            "GenderCode",
                            "GENERATION",
                            "NATIONALITY", 
                            "NATIONAL_ID", 
                            "DRIVER_LICENSE").dropDuplicates()

customer_df

DataFrame[CUST_ID: int, CUSTNAME: string, ADDRESS1: string, ADDRESS2: string, CITY: string, POSTAL_CODE: string, POSTAL_CODE_PLUS4: int, STATE: string, COUNTRY_CODE: string, EMAIL_ADDRESS: string, PHONE_NUMBER: string, AGE: string, GenderCode: string, GENERATION: string, NATIONALITY: string, NATIONAL_ID: string, DRIVER_LICENSE: string]

Notice that the data type of the AGE column is currently a string. Convert the AGE column to a numeric data type so you can run calculations on customer age.

In [5]:
# ---------------------------------------
# Cleanse age (enforce numeric data type) 
# ---------------------------------------

def getNumericVal(col):
    """
    input: pyspark.sql.types.Column
    output: the numeric value represented by col or None
    """
    try:
      return int(col)
    except ValueError:
      # age-33
      match = re.match('^age\-(\d+)$', col)
      if match:
        try:
          return int(match.group(1))
        except ValueError:    
          return None
      return None  

toNumericValUDF = func.udf(lambda c: getNumericVal(c), types.IntegerType())
customer_df = customer_df.withColumn("AGE", toNumericValUDF(customer_df["AGE"]))

The GenderCode column contains salutations instead of gender values. Derive the gender information for each customer based on the salutation and rename the GenderCode column to GENDER.

In [6]:
# ------------------------------
# Derive gender from salutation
# ------------------------------
def deriveGender(col):
    """ input: pyspark.sql.types.Column
        output: "male", "female" or "unknown"
    """    
    if col in ['Mr.', 'Master.']:
        return 'male'
    elif col in ['Mrs.', 'Miss.']:
        return 'female'
    else:
        return 'unknown';
    
deriveGenderUDF = func.udf(lambda c: deriveGender(c), types.StringType())
customer_df = customer_df.withColumn("GENDER", deriveGenderUDF(customer_df["GenderCode"]))
customer_df.cache()

DataFrame[CUST_ID: int, CUSTNAME: string, ADDRESS1: string, ADDRESS2: string, CITY: string, POSTAL_CODE: string, POSTAL_CODE_PLUS4: int, STATE: string, COUNTRY_CODE: string, EMAIL_ADDRESS: string, PHONE_NUMBER: string, AGE: int, GenderCode: string, GENERATION: string, NATIONALITY: string, NATIONAL_ID: string, DRIVER_LICENSE: string, GENDER: string]

## Explore the customer data set

You can quickly explore data sets using PixieDust's data set explorer. Invoke the `display()` command and click the table icon to review the schema and preview the data. Customize the options to display only a subset of the fields or rows or apply a filter (by clicking the funnel icon).

In [7]:
display(customer_df)

[Back to Table of Contents](#toc)
## Visualize customer demographics and locations

Now you're ready explore the customer base. Using simple charts, you can quickly see these characteristics:
 * Customer demographics (gender and age)
 * Customer locations (city, state, and country)

You'll create charts with the PixieDust library:

 - [View customers by gender in a pie chart](#View-customers-by-gender-in-a-pie-chart)
 - [View customers by generation in a bar chart](#View-customers-by-generation-in-a-bar-chart)
 - [View customers by age in a histogram chart](#View-customers-by-age-in-a-histogram-chart)
 - [View specific information with a filter function](#View-specific-information-with-a-filter-function)
 - [View customer density by location with a map](#View-customer-density-by-location-with-a-map)

### View customers by gender in a pie chart

Run the `display()` command and then configure the graph to show the percentages of male and female customers:

1. Run the next cell. The PixieDust interactive widget appears.  
1. Click the chart button and choose **Pie Chart**. The chart options tool appears.
1. In the chart options, drag `GENDER` into the **Keys** box. 
1. In the **Aggregation** field, choose **COUNT**. 
1. Click **OK**. The pie chart appears.

If you want to make further changes, click **Options** to return to the chart options tool.

In [8]:
display(customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER
12123,Joseph Dyke,364 Buena Vista Avenue,,Morrisville,27560,0,NC,US,Joseph.E.Dyke@mailinator.com,615-780-1112,26.0,Mr.,Gen_Y,IT,RNSVAE87H57E507C,,male
10013,Norbert Cantu,2124 Henry Ford Avenue,,Ada,49301,0,MI,US,Norbert.T.Cantu@spambob.com,614-355-2446,55.0,Mr.,Baby_Boomers,CA,272701863,,male
10985,Grace Alfaro,2284 Whitetail Lane,,Eagleville,19403,0,PA,US,Grace.J.Alfaro@mailinator.com,562-802-9460,,Mrs.,Gen_Z,ES,8857379X,,female
11249,Carolyn Delacruz,1388 Ingram Street,,Glencoe West,5291,0,SA,AU,Carolyn.J.Delacruz@dodgeit.com,(03) 5329 9256,46.0,Mrs.,Gen_X,IT,NWDJMZ72L16A954T,,female
12122,Caitlin Manley,2910 Gateway Avenue,,Morrisville,27560,0,NC,US,Caitlin.J.Manley@trashymail.com,586-634-5035,50.0,Mrs.,Gen_X,ES,X2010794L,,female
10645,Adam Blakeney,1139 Nickel Road,,Catanzaro Sala,88100,0,CZ,IT,Adam.T.Blakeney@trashymail.com,0349 5961114,60.0,Mr.,Baby_Boomers,U.S.,22868217,,male
11602,Jack Cantrell,4843 Deans Lane,,Koongawa,5650,0,SA,AU,Jack.E.Cantrell@mailinator.com,(07) 4931 4430,,Mr.,Gen_Z,ES,3754080C,,male
14971,Jayme Bellantoni,1957 Seth Street,,Los Angeles,90003,0,CA,US,Jayme.Bellantoni@csc.jp,665-332-9916,22.0,Mrs.,Gen_Z,U.S.,22747484,,female
12212,Steven Parker,3881 Colonial Drive,,New Brighton,55112,0,MN,US,Steven.J.Parker@mailinator.com,505-299-1706,57.0,Mr.,Baby_Boomers,ES,7589661Y,,male
11935,Essie Jones,1557 Rhapsody Street,,Maurine,57627,0,SD,US,Essie.F.Jones@pookmail.com,718-351-2568,45.0,Mrs.,Gen_X,CA,397614645,,female


[Back to Table of Contents](#toc)
### View customers by generation in a bar chart
Look at how many customers you have per "generation."

Run the next cell and configure the graph: 
1. Choose **Bar Chart** as the chart type and configure the chart options as instructed below.
2. Put `GENERATION` into the **Keys** box.
3. Set **aggregation** to `COUNT`.
4. Click **OK**
4. Change the **Renderer** at the top right of the chart to explore different visualisations.  

In [9]:
display(customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER
11139,Jeffrey Steward,1139 Walton Street,,Fort Lauderdale,33301,0,FL,US,Jeffrey.P.Steward@dodgeit.com,217-381-7127,24.0,Mr.,Gen_Y,ES,2914553Q,,male
11352,Lonnie Robinson,1451 Tennessee Avenue,,Harold,41635,0,KY,US,Lonnie.M.Robinson@spambob.com,404-899-3014,71.0,Mr.,Baby_Boomers,ES,1505024L,,male
11615,Miranda Varnum,928 Dola Mine Road,,La Plaine-saint-denis,93210,0,,FR,Miranda.M.Varnum@pookmail.com,02.38.80.64.11,,Mrs.,Gen_Z,IT,KLXOUO65H43F146Y,,female
11927,Kerri Cooper,4281 Barnes Avenue,,Mataro,8300,0,,ES,Kerri.B.Cooper@trashymail.com,91-271-2781,69.0,Mrs.,Baby_Boomers,IT,TCOVGC04L62H642M,,female
10512,Robert Moon,1614 Flanigan Oaks Drive,,Callicoon,12723,0,NY,US,Robert.C.Moon@pookmail.com,985-407-2085,65.0,Mr.,Baby_Boomers,ES,8059361T,,male
15477,Joyce Morales,4704 Round Table Drive,,Los Angeles,90046,0,CA,US,Joyce.Morales@pookmail.com,620-689-7822,49.0,Mrs.,Gen_X,U.S.,22747484,,female
13728,Rod Falcon,1951 Lake Floyd Circle,,Los Angeles,90040,0,CA,US,Rod.Falcon@fact-mail.com,641-132-8160,68.0,Mr.,Baby_Boomers,U.S.,22747484,,male
14262,Burl Dugger,54 New Street,,Los Angeles,90023,0,CA,US,Burl.Dugger@kyouin.com,722-335-3794,72.0,Mr.,Baby_Boomers,U.S.,22747484,,male
12256,Jonathan Etter,2795 Stratford Park,,Norfolk,23502,0,VA,US,Jonathan.M.Etter@dodgeit.com,903-560-3160,72.0,Mr.,Baby_Boomers,UK,EY082703C,,male
13906,Carmen Gulledge,3286 Walt Nuzum Farm Road,,Los Angeles,90035,0,CA,US,Carmen.Gulledge@lycos.com,725-432-5327,49.0,Mrs.,Gen_X,U.S.,22747484,,female


You can use clustering to group customers, for example by geographic location. To group generations by country, select `COUNTRY_CODE` from the **Cluster by** list. 

[Back to Table of Contents](#toc)
### View customers by age in a histogram chart
A generation is a broad age range. You can look at a smaller age range with a histogram chart. A histogram is like a bar chart except each bar represents a range of numbers, called a bin. You can customize the size of the age range by adjusting the bin size. The more bins you specify, the smaller the age range.

Run the next cell and configure the graph:
1. Choose **Histogram** as the chart type. 
2. Put `AGE` into the **Values** box and click **OK**.
3. Use the **Bin count** slider to specify the number of the bins. Try starting with 40.

In [10]:
display(customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER
11466,Serafina Wheeler,1745 Fort Street,,Inglewood,3517,0,VIC,AU,Serafina.J.Wheeler@trashymail.com,(02) 6765 1303,21.0,Mrs.,Gen_Z,IT,JFDXHJ28L17G446V,,female
12617,Susan Leventhal,109 Pickens Way,,Richmond,23222,0,VA,US,Susan.M.Leventhal@spambob.com,740-910-2749,,Mrs.,Gen_Z,IT,WIUTYN42E70F219G,,female
11243,Juan Gilmore,3959 Whiteman Street,,Giudecca,30133,0,VE,IT,Juan.D.Gilmore@trashymail.com,0334 3577209,70.0,Mr.,Baby_Boomers,UK,WL555884B,,male
11885,Larry Gravely,552 Formula Lane,,Marion,54950,0,WI,US,Larry.W.Gravely@mailinator.com,618-344-3812,26.0,Mr.,Gen_Z,U.S.,538590001,,male
13176,Everette Roland,437 Big Elm,,Torrimpietra,50,0,RM,IT,Everette.C.Roland@dodgeit.com,0399 0917557,20.0,Mr.,Gen_Z,CA,478969892,,male
11525,Laura Samuels,2855 Center Street,,Johnstones Hill,3870,0,VIC,AU,Laura.L.Samuels@spambob.com,(02) 6778 1332,38.0,Mrs.,Gen_Y,U.S.,230990001,,female
14895,Curtis Patterson,844 Chipmunk Lane,,Los Angeles,90002,0,CA,US,Curtis.Patterson@lopox.com,144-569-3167,38.0,Mr.,Gen_Y,U.S.,22747484,,male
15252,Bobbie Barnes,2379 Diane Street,,Los Angeles,90088,0,CA,US,Bobbie.Barnes@kobej.zzn.com,633-177-5628,69.0,Miss.,Gen_Z,U.S.,22747484,,female
10666,Nancy Mills,2672 Beechwood Drive,,Cement Mills,4352,0,QLD,AU,Nancy.B.Mills@dodgeit.com,(03) 9034 8246,47.0,Mrs.,Gen_X,FR,1.73E+14,,female
11999,Jean Blas,8 Angie Drive,,Miano,80131,0,,IT,Jean.A.Blas@trashymail.com,0344 7698695,47.0,Master.,Gen_X,CA,110581147,,male


[Back to Table of Contents](#toc)
### View specific information with a filter function

You can filter records to restrict analysis by using the [PySpark DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame) `filter()` function.

If you want to view the age distribution for a specific generation, uncomment the desired filter condition and run the next cell:

In [11]:
# Data subsetting: display age distribution for a specific generation
# (Chart type: histogram, Chart Options > Values: AGE)
# to change the filter condition remove the # sign 
condition = "GENERATION = 'Baby_Boomers'"
#condition = "GENERATION = 'Gen_X'"
#condition = "GENERATION = 'Gen_Y'"
#condition = "GENERATION = 'Gen_Z'"
display(customer_df.filter(condition))


PixieDust supports basic filtering to make it easy to analyse data subsets. For example, to view the age distribution for a specific gender configure the chart as follows:

  1. Choose `Histogram` as the chart type.
  2. Put `AGE` into the **Values** box and click OK.
  3. Click the filter button (looking like a funnel), and choose **GENDER** as field and `female` as value.
  
The filter is only applied to the working data set and does not modify the input `customer_df`.


In [12]:
display(customer_df)

You can also filter by location. For example, the following command creates a new DataFrame that filters for customers from the USA:

In [13]:
condition = "COUNTRY_CODE = 'US'"
us_customer_df = customer_df.filter(condition)

You can pivot your analysis perspective based on aspects that are of interest to you by choosing different keys and clusters.

Create a bar chart and cluster the data.

Run the next cell and configure the graph:
1. Choose **Bar chart** as the chart type.
2. Put `COUNTRY_CODE` into the **Keys** box.
4. Set Aggregation to **COUNT**.
5. Click **OK**. The chart displays the number of US customers.
6. From the **Cluster By** list, choose **GENDER**. The chart shows the number of customers by gender.

In [14]:
display(us_customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER,MEDIAN_INCOME_IN_ZIP
15170,Bradford Stokes,1759 Charack Road,,Los Angeles,90048,0,CA,US,Bradford.Stokes@1mile.jp,372-606-6996,64.0,Mr.,Baby_Boomers,U.S.,22747484,,male,72701.0
13643,Rosalia Gulledge,3492 Emeral Dreams Drive,,Los Angeles,90087,0,CA,US,Rosalia.Gulledge@spambob.com,276-332-5574,19.0,Mrs.,Gen_Z,U.S.,22747484,,female,
15069,Latonia Davidson,2108 Rivendell Drive,,Los Angeles,90099,0,CA,US,Latonia.Davidson@fact-mail.com,224-738-5956,,Mrs.,Gen_Z,U.S.,22747484,,female,
14661,Bradford Black,357 Pringle Drive,,Los Angeles,90019,0,CA,US,Bradford.Black@nifmail.jp,111-652-4559,66.0,Mr.,Baby_Boomers,U.S.,22747484,,male,42043.0
13784,Willene Felix,3199 Hilltop Drive,,Los Angeles,90041,0,CA,US,Willene.Felix@pookmail.com,653-334-1722,54.0,Mrs.,Baby_Boomers,U.S.,22747484,,female,63770.0
14044,Jimmy Hart,784 Badger Pond Lane,,Los Angeles,90057,0,CA,US,Jimmy.Hart@meritmail.net,354-186-8372,55.0,Mr.,Baby_Boomers,U.S.,22747484,,male,28035.0
15500,Clyde Basnight,3607 Locust Street,,Los Angeles,90070,0,CA,US,Clyde.Basnight@vjp.jp,356-228-9943,72.0,Mr.,Baby_Boomers,U.S.,22747484,,male,
14971,Jayme Bellantoni,1957 Seth Street,,Los Angeles,90003,0,CA,US,Jayme.Bellantoni@csc.jp,665-332-9916,22.0,Mrs.,Gen_Z,U.S.,22747484,,female,29686.0
15266,Francis Fontanez,4505 Kyle Street,,Los Angeles,90062,0,CA,US,Francis.Fontanez@lycos.com,263-156-8968,39.0,Master.,Gen_Y,U.S.,22747484,,male,33192.0
14111,Lamar Evans,563 Pyramid Valley Road,,Los Angeles,90080,0,CA,US,Lamar.Evans@estyle.ne.jp,266-440-8764,35.0,Mr.,Gen_Y,U.S.,22747484,,male,


Now try to cluster the customers by state.

A bar chart isn't the best way to show geographic location!

[Back to Table of Contents](#toc)
### View customer density by location with a map
Maps are a much better way to view location data than other chart types. 

Visualize customer density by US state with a map.

Run the next cell and configure the graph:
1. Choose **Map** as the chart type.
2. Put `STATE` into the **Keys** box.
4. Set Aggregation to **COUNT**.
5. Click **OK**. The map displays the number of US customers.
6. From the **Renderer** list, choose **brunel**.
   > PixieDust supports three map renderers: brunel, [mapbox](https://www.mapbox.com/) and Google. Note that the Mapbox renderer and the Google renderer require an API key or access token and supported features vary by renderer.

In [15]:
display(us_customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER,MEDIAN_INCOME_IN_ZIP
10275,Robert Ojeda,546 Lakewood Drive,,Belleville,62220,0,IL,US,Robert.M.Ojeda@mailinator.com,503-705-8957,76.0,Mr.,Baby_Boomers,FR,2.29E+14,,male,51992.0
12021,Patricia Halloran,2772 Tail Ends Road,,Minneapolis,55415,0,MN,US,Patricia.D.Halloran@spambob.com,715-601-0618,39.0,Mrs.,Gen_Y,UK,LM322694B,,female,52736.0
13771,Anabel Harris,1964 Stonepot Road,,Los Angeles,90067,0,CA,US,Anabel.Harris@fact-mail.com,452-201-8433,77.0,Mrs.,Baby_Boomers,U.S.,22747484,,female,90972.0
12617,Susan Leventhal,109 Pickens Way,,Richmond,23222,0,VA,US,Susan.M.Leventhal@spambob.com,740-910-2749,,Mrs.,Gen_Z,IT,WIUTYN42E70F219G,,female,34234.0
13658,Ron Steadman,1814 657 Saints Alley,,Los Angeles,90001,0,CA,US,Ron.Steadman@ultrapostman.com,111-277-9401,51.0,Mr.,Gen_Z,U.S.,22747484,,male,35097.0
14432,Esmeralda Gault,440 Jett Lane,,Los Angeles,90003,0,CA,US,Esmeralda.Gault@estyle.ne.jp,134-231-1992,19.0,Mrs.,Gen_Z,U.S.,22747484,,female,29686.0
14993,Teri Follansbee,2311 Trainer Avenue,,Los Angeles,90055,0,CA,US,Teri.Follansbee@yahoo.co.jp,102-592-4520,22.0,Mrs.,Gen_Z,U.S.,22747484,,female,
15090,Rachel Byrne,3362 Lake Road,,Los Angeles,90063,0,CA,US,Rachel.Byrne@pc_run.zzn.com,460-485-8179,53.0,Mrs.,Gen_Z,U.S.,22747484,,female,38441.0
13906,Carmen Gulledge,3286 Walt Nuzum Farm Road,,Los Angeles,90035,0,CA,US,Carmen.Gulledge@lycos.com,725-432-5327,49.0,Mrs.,Gen_X,U.S.,22747484,,female,75863.0
15291,Ruth Sizemore,3263 Richison Drive,,Los Angeles,90058,0,CA,US,Ruth.Sizemore@uymail.com,603-740-6678,32.0,Mrs.,Gen_Y,U.S.,22747484,,female,16750.0


You can explore more about customers in each state by changing the aggregation method, for example look at customer age ranges (avg, minimum, and maximum) by state. Simply Change the aggregation function to `AVG`, `MIN`, or `MAX` and choose `AGE` as value.

[Back to Table of Contents](#toc)
## Enrich demographic information with open data
You can easily combine other sources of data with your existing data. There's a lot of publicly available open data sets that can be very helpful. For example, knowing the approximate income level of your customers might help you target your marketing campaigns.

Run the next cell to load [this data set](https://apsportal.ibm.com/exchange/public/entry/view/beb8c30a3f559e58716d983671b70337) from the United States Census Bureau into your notebook. The data set contains US household income statistics compiled at the zip code geography level.

In [16]:
# Load median income information for all US ZIP codes from a public source
income_df = pixiedust.sampleData('https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af')

Downloading 'https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af' from https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af
Downloaded 6007673 bytes
Creating pySpark DataFrame for 'https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af'. Please wait...
Loading file using 'SparkSession'
Successfully created pySpark DataFrame for 'https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af'


Now cleanse the income data set to remove the data that you don't need. Create a new DataFrame for this data:
 - The zip code, extracted from the GEOID column.
 - The column B19049e1, which contains the median household income for 2013.

In [17]:
# ------------------------------
# Helper: Extract ZIP code
# ------------------------------
def extractZIPCode(col):
    """ input: pyspark.sql.types.Column containing a geo code, like '86000US01001'
        output: ZIP code
    """
    m = re.match('^\d+US(\d\d\d\d\d)$',col)
    if m:
        return m.group(1)
    else:
        return None    
    
getZIPCodeUDF = func.udf(lambda c: extractZIPCode(c), types.StringType())
income_df = income_df.select('GEOID', 'B19049e1').withColumnRenamed('B19049e1', 'MEDIAN_INCOME_IN_ZIP').withColumn("ZIP", getZIPCodeUDF(income_df['GEOID']))
income_df

DataFrame[GEOID: string, MEDIAN_INCOME_IN_ZIP: int, ZIP: string]

Now perform a left outer join on the customer data set with the income data set, using the zip code as the join condition. For the complete syntax of joins, go to the <a href="https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame" target="_blank" rel="noopener noreferrer">pyspark DataFrame documentation</a> and scroll down to the `join` syntax. 

In [18]:
us_customer_df = us_customer_df.join(income_df, us_customer_df.POSTAL_CODE == income_df.ZIP, 'left_outer').drop('GEOID').drop('ZIP')

Now you can visualize the income distribution of your customers by zip code.
 Visualize income distribution for our customers.
Run the next cell and configure the graph:
1. Choose **Histogram** as the chart type.
2. Put `MEDIAN_INCOME_IN_ZIP` into the **Values** box and click **OK**.

In [19]:
display(us_customer_df)

The majority of your customers live in zip codes where the median income is around 40,000 USD. 

[Back to Table of Contents](#toc)


<a id="summary"></a>
## Summary and next steps

You successfully completed this notebook!  

Check out other notebooks in this series: 
 - Localcart scenario one: Dynamic data analysis and visualization
 - Localcart scenario three: Build a product recommendation engine
 - Localcart scenario four: Build a revenue dashboard using PixieApps

Copyright © 2017, 2018 IBM. This notebook and its source code are released under the terms of the MIT License.