# Static data analysis using Python, Apache Spark,  and PixieDust
***

In this notebook, you'll first analyze customer demographics, such as, age, gender, income, and location. Then you'll combine that data with sales data to examine trends for product categories, transaction types, and product popularity. You'll load data from a previous notebook as well as from a public open data set, cleanse, shape, and enrich the data, and then visualize the data with the PixieDust library. Don't worry! PixieDust graphs don't require coding. By the end of the notebook, you'll understand how to combine data to gain insights about which customers you might target to increase sales.

<img src="https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/images/part_1.png"></img>

This notebook runs on Python 2 with Spark 2.1, and PixieDust 1.1.10.

<a id="toc"></a>
## Table of contents

#### [Setup](#Setup)
[Load data into the notebook](#Load-data-into-the-notebook)
#### [Part 1. Explore customer demographics](#part1)
[Prepare the customer data set](#Prepare-the-customer-data-set)<br>
[Visualize customer demographics and locations](#Visualize-customer-demographics-and-locations)<br>
[Enrich demographic information with open data](#Enrich-demographic-information-with-open-data)<br>   

#### [Summary and next steps](#summary)

## Setup
You need to import libraries and load the customer data into this notebook.

- Install the most current packages so we can take advantage of the latest features.

In [1]:
# run this cell
# jinja2 version 2.10 is required
#! pip install jinja2 --user --upgrade
# pixiedust version 1.1.7.1 (or above) is required
#! pip install pixiedust --user --upgrade
#!pip install -U --no-deps bokeh

> **If any package was updated restart the kernel and reload the browser page.**

Import the necessary libraries:

In [1]:
import pixiedust
import pyspark.sql.functions as func
import pyspark.sql.types as types
import re
import json
import os
import requests  

Pixiedust database opened successfully


### Load data into the notebook

The data file contains both the customer demographic data that you'll analyzed in Part 1, and the sales transaction data for Part 2.

In [2]:
raw_df = pixiedust.sampleData('https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv')

Downloading 'https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv' from https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv
Downloaded 5648773 bytes
Creating pySpark DataFrame for 'https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv'. Please wait...
Loading file using 'SparkSession'
Successfully created pySpark DataFrame for 'https://raw.githubusercontent.com/IBMCodeLondon/localcart-workshop/master/data/customers_orders1_opt.csv'


[Back to Table of Contents](#toc)
<a id="part1"></a>
# Part 1. Explore customer demographics 
In this part of the notebook, you'll prepare the customer data and then start learning about your customers by creating multiple charts and maps. 

## Prepare the customer data set
You'll create a new DataFrame with just the data you need and then cleanse and enrich the data.

Extract the columns that you want, remove duplicate customers, and add a column for aggregations:

In [3]:
# Extract the customer information from the data set
# CUSTNAME: string, GenderCode: string, ADDRESS1: string, CITY: string, STATE: string, COUNTRY_CODE: string, POSTAL_CODE: string, POSTAL_CODE_PLUS4: int, ADDRESS2: string, EMAIL_ADDRESS: string, PHONE_NUMBER: string, CREDITCARD_TYPE: string, LOCALITY: string, SALESMAN_ID: string, NATIONALITY: string, NATIONAL_ID: string, CREDITCARD_NUMBER: bigint, DRIVER_LICENSE: string, CUST_ID: int,
customer_df = raw_df.select("CUST_ID", 
                            "CUSTNAME", 
                            "ADDRESS1", 
                            "ADDRESS2", 
                            "CITY", 
                            "POSTAL_CODE", 
                            "POSTAL_CODE_PLUS4", 
                            "STATE", 
                            "COUNTRY_CODE", 
                            "EMAIL_ADDRESS", 
                            "PHONE_NUMBER",
                            "AGE",
                            "GenderCode",
                            "GENERATION",
                            "NATIONALITY", 
                            "NATIONAL_ID", 
                            "DRIVER_LICENSE").dropDuplicates()

customer_df

DataFrame[CUST_ID: int, CUSTNAME: string, ADDRESS1: string, ADDRESS2: string, CITY: string, POSTAL_CODE: string, POSTAL_CODE_PLUS4: int, STATE: string, COUNTRY_CODE: string, EMAIL_ADDRESS: string, PHONE_NUMBER: string, AGE: string, GenderCode: string, GENERATION: string, NATIONALITY: string, NATIONAL_ID: string, DRIVER_LICENSE: string]

Notice that the data type of the AGE column is currently a string. Convert the AGE column to a numeric data type so you can run calculations on customer age.

In [4]:
# ---------------------------------------
# Cleanse age (enforce numeric data type) 
# ---------------------------------------

def getNumericVal(col):
    """
    input: pyspark.sql.types.Column
    output: the numeric value represented by col or None
    """
    try:
      return int(col)
    except ValueError:
      # age-33
      match = re.match('^age\-(\d+)$', col)
      if match:
        try:
          return int(match.group(1))
        except ValueError:    
          return None
      return None  

toNumericValUDF = func.udf(lambda c: getNumericVal(c), types.IntegerType())
customer_df = customer_df.withColumn("AGE", toNumericValUDF(customer_df["AGE"]))

The GenderCode column contains salutations instead of gender values. Derive the gender information for each customer based on the salutation and rename the GenderCode column to GENDER.

In [5]:
# ------------------------------
# Derive gender from salutation
# ------------------------------
def deriveGender(col):
    """ input: pyspark.sql.types.Column
        output: "male", "female" or "unknown"
    """    
    if col in ['Mr.', 'Master.']:
        return 'male'
    elif col in ['Mrs.', 'Miss.']:
        return 'female'
    else:
        return 'unknown';
    
deriveGenderUDF = func.udf(lambda c: deriveGender(c), types.StringType())
customer_df = customer_df.withColumn("GENDER", deriveGenderUDF(customer_df["GenderCode"]))
customer_df.cache()

DataFrame[CUST_ID: int, CUSTNAME: string, ADDRESS1: string, ADDRESS2: string, CITY: string, POSTAL_CODE: string, POSTAL_CODE_PLUS4: int, STATE: string, COUNTRY_CODE: string, EMAIL_ADDRESS: string, PHONE_NUMBER: string, AGE: int, GenderCode: string, GENERATION: string, NATIONALITY: string, NATIONAL_ID: string, DRIVER_LICENSE: string, GENDER: string]

## Explore the customer data set

You can quickly explore data sets using PixieDust's data set explorer. Invoke the `display()` command and click the table icon to review the schema and preview the data. Customize the options to display only a subset of the fields or rows or apply a filter (by clicking the funnel icon).

In [6]:
display(customer_df)

[Back to Table of Contents](#toc)
## Visualize customer demographics and locations

Now you're ready explore the customer base. Using simple charts, you can quickly see these characteristics:
 * Customer demographics (gender and age)
 * Customer locations (city, state, and country)

You'll create charts with the PixieDust library:

 - [View customers by gender in a pie chart](#View-customers-by-gender-in-a-pie-chart)
 - [View customers by generation in a bar chart](#View-customers-by-generation-in-a-bar-chart)
 - [View customers by age in a histogram chart](#View-customers-by-age-in-a-histogram-chart)
 - [View specific information with a filter function](#View-specific-information-with-a-filter-function)
 - [View customer density by location with a map](#View-customer-density-by-location-with-a-map)

### View customers by gender in a pie chart

Run the `display()` command and then configure the graph to show the percentages of male and female customers:

1. Run the next cell. The PixieDust interactive widget appears.  
1. Click the chart button and choose **Pie Chart**. The chart options tool appears.
1. In the chart options, drag `GENDER` into the **Keys** box. 
1. In the **Aggregation** field, choose **COUNT**. 
1. Click **OK**. The pie chart appears.

If you want to make further changes, click **Options** to return to the chart options tool.

In [7]:
display(customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER
12005,James Hammond,2440 Alpaca Way,,Mignanego,16018,0,GE,IT,James.D.Hammond@mailinator.com,0310 3414620,30.0,Mr.,Gen_Y,U.S.,8900001,,male
11821,Robert Hunt,4013 Bombardier Way,,Lunata,55010,0,LU,IT,Robert.M.Hunt@spambob.com,0324 8012754,61.0,Mr.,Baby_Boomers,ES,1037902G,,male
11917,Loretta Batton,1189 Wyatt Street,,Maserno,41055,0,MO,IT,Loretta.P.Batton@trashymail.com,0391 2591255,57.0,Mrs.,Gen_Z,IT,EOIKOA28D55L860I,,female
15480,Crystal Blum,1762 May Street,,Los Angeles,90013,0,CA,US,Crystal.Blum@mail.goo.ne.jp,460-618-7343,32.0,Mrs.,Gen_Y,U.S.,22747484,,female
13992,Caitlin Ford,2998 Wolf Pen Road,,Los Angeles,90189,0,CA,US,Caitlin.Ford@popj.com,424-170-6876,66.0,Mrs.,Baby_Boomers,U.S.,22747484,,female
12742,Jason Bien,1537 Fleming Street,,Saldana,34100,0,,ES,Jason.C.Bien@trashymail.com,91-102-2379,63.0,Mr.,Baby_Boomers,UK,PA781254,,male
11796,Robin Porter,1577 Sunburst Drive,,Los Banos,93635,0,CA,US,Robin.R.Porter@pookmail.com,501-620-1620,,Master.,Gen_Z,FR,2.82E+14,,male
14971,Jayme Bellantoni,1957 Seth Street,,Los Angeles,90003,0,CA,US,Jayme.Bellantoni@csc.jp,665-332-9916,22.0,Mrs.,Gen_Z,U.S.,22747484,,female
10947,Nina Butler,1663 Carter Street,,Dodge City,67801,0,KS,US,Nina.J.Butler@mailinator.com,402-804-8282,42.0,Mrs.,Gen_X,IT,DYLBYI81P63A330L,,female
14990,Janette Oram,3410 Calvin Street,,Los Angeles,90061,0,CA,US,Janette.Oram@spambob.com,644-695-5450,30.0,Mrs.,Gen_Y,U.S.,22747484,,female


[Back to Table of Contents](#toc)
### View customers by generation in a bar chart
Look at how many customers you have per "generation."

Run the next cell and configure the graph: 
1. Choose **Bar Chart** as the chart type and configure the chart options as instructed below.
2. Put `GENERATION` into the **Keys** box.
3. Set **aggregation** to `COUNT`.
4. Click **OK**
4. Change the **Renderer** at the top right of the chart to explore different visualisations.  

In [8]:
display(customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER
13060,Shirley Christensen,4104 Huntz Lane,,Stribugliano,58040,0,GR,IT,Shirley.E.Christensen@pookmail.com,0396 9717121,54.0,Mrs.,Baby_Boomers,CA,109544130,,female
11354,Dave Iddings,1097 Mesa Drive,,Harrisburg,17109,0,PA,US,Dave.A.Iddings@dodgeit.com,713-254-3302,69.0,Mr.,Baby_Boomers,IT,KCEPNZ39E27G746D,,male
15179,Clair West,4128 Trymore Road,,Los Angeles,90070,0,CA,US,Clair.West@estyle.ne.jp,524-386-7155,57.0,Miss.,Baby_Boomers,U.S.,22747484,,female
13026,Lewis Huffman,411 Tanglewood Road,,St. Kassian,39030,0,BZ,IT,Lewis.S.Huffman@dodgeit.com,0334 0345737,64.0,Mr.,Baby_Boomers,CA,673327888,,male
13947,Paula Rogers,447 Pallet Street,,Los Angeles,90019,0,CA,US,Paula.Rogers@sailormoon.com,606-652-1883,21.0,Mrs.,Gen_Z,U.S.,22747484,,female
10724,Wayne Cohen,2641 Queens Lane,,Chicago,60606,0,IL,US,Wayne.J.Cohen@trashymail.com,864-266-1322,78.0,Mr.,Baby_Boomers,FR,2.03E+14,,male
14349,Belinda Bilbo,477 University Drive,,Los Angeles,90056,0,CA,US,Belinda.Bilbo@fact-mail.com,720-741-9900,43.0,Mrs.,Gen_X,U.S.,22747484,,female
14520,Ernest Cassidy,2416 Bubby Drive,,Los Angeles,90049,0,CA,US,Ernest.Cassidy@mini-blog.net,600-152-2314,38.0,Mr.,Gen_Z,U.S.,22747484,,male
14158,Lola Klahn,290 Moonlight Drive,,Los Angeles,90014,0,CA,US,Lola.Klahn@vjp.jp,356-105-9304,24.0,Mrs.,Gen_Y,U.S.,22747484,,female
10693,Doris Lopez,4810 Layman Court,,Chapel Hill,27514,0,NC,US,Doris.E.Lopez@trashymail.com,212-420-8607,45.0,Mrs.,Gen_X,ES,9841086F,,female


You can use clustering to group customers, for example by geographic location. To group generations by country, select `COUNTRY_CODE` from the **Cluster by** list. 

[Back to Table of Contents](#toc)
### View customers by age in a histogram chart
A generation is a broad age range. You can look at a smaller age range with a histogram chart. A histogram is like a bar chart except each bar represents a range of numbers, called a bin. You can customize the size of the age range by adjusting the bin size. The more bins you specify, the smaller the age range.

Run the next cell and configure the graph:
1. Choose **Histogram** as the chart type. 
2. Put `AGE` into the **Values** box and click **OK**.
3. Use the **Bin count** slider to specify the number of the bins. Try starting with 40.

In [9]:
display(customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER
12021,Patricia Halloran,2772 Tail Ends Road,,Minneapolis,55415,0,MN,US,Patricia.D.Halloran@spambob.com,715-601-0618,39.0,Mrs.,Gen_Y,UK,LM322694B,,female
13090,Virginia Lambert,1273 Garfield Road,,Tahmoor,2573,0,NSW,AU,Virginia.G.Lambert@mailinator.com,(07) 3707 5925,29.0,Mrs.,Gen_Y,CA,621936657,,female
10512,Robert Moon,1614 Flanigan Oaks Drive,,Callicoon,12723,0,NY,US,Robert.C.Moon@pookmail.com,985-407-2085,65.0,Mr.,Baby_Boomers,ES,8059361T,,male
14557,Argentina Beaudry,2212 County Line Road,,Los Angeles,90013,0,CA,US,Argentina.Beaudry@docomo.ne.jp,250-604-7894,39.0,Mrs.,Gen_Y,U.S.,22747484,,female
14613,Michelle Luna,2323 High Meadow Lane,,Los Angeles,90084,0,CA,US,Michelle.Luna@kobej.zzn.com,625-654-9433,38.0,Mrs.,Gen_Y,U.S.,22747484,,female
14554,Angeline Moe,4978 Cityview Drive,,Los Angeles,90024,0,CA,US,Angeline.Moe@nifmail.jp,466-272-4519,28.0,Mrs.,Gen_Y,U.S.,22747484,,female
13728,Rod Falcon,1951 Lake Floyd Circle,,Los Angeles,90040,0,CA,US,Rod.Falcon@fact-mail.com,641-132-8160,68.0,Mr.,Baby_Boomers,U.S.,22747484,,male
15308,Armando Holgate,4402 Walnut Drive,,Los Angeles,90093,0,CA,US,Armando.Holgate@excite.co.jp,331-633-1442,26.0,Mr.,Gen_Y,U.S.,22747484,,male
13037,Dion Winston,561 Ashmor Drive,,Steinfeld,5356,0,SA,AU,Dion.M.Winston@mailinator.com,(02) 4999 0258,64.0,Mr.,Baby_Boomers,FR,2.55E+14,,male
12206,Deborah Adams,209 Pursglove Court,,Nevers,58000,0,,FR,Deborah.N.Adams@spambob.com,05.12.85.51.67,72.0,Mrs.,Baby_Boomers,IT,CBWGBL45R54B656N,,female


[Back to Table of Contents](#toc)
### View specific information with a filter function

You can filter records to restrict analysis by using the [PySpark DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame) `filter()` function.

If you want to view the age distribution for a specific generation, uncomment the desired filter condition and run the next cell:

In [10]:
# Data subsetting: display age distribution for a specific generation
# (Chart type: histogram, Chart Options > Values: AGE)
# to change the filter condition remove the # sign 
condition = "GENERATION = 'Baby_Boomers'"
#condition = "GENERATION = 'Gen_X'"
#condition = "GENERATION = 'Gen_Y'"
#condition = "GENERATION = 'Gen_Z'"
display(customer_df.filter(condition))


PixieDust supports basic filtering to make it easy to analyse data subsets. For example, to view the age distribution for a specific gender configure the chart as follows:

  1. Choose `Histogram` as the chart type.
  2. Put `AGE` into the **Values** box and click OK.
  3. Click the filter button (looking like a funnel), and choose **GENDER** as field and `female` as value.
  
The filter is only applied to the working data set and does not modify the input `customer_df`.


In [11]:
display(customer_df)

You can also filter by location. For example, the following command creates a new DataFrame that filters for customers from the USA:

In [12]:
condition = "COUNTRY_CODE = 'US'"
us_customer_df = customer_df.filter(condition)

You can pivot your analysis perspective based on aspects that are of interest to you by choosing different keys and clusters.

Create a bar chart and cluster the data.

Run the next cell and configure the graph:
1. Choose **Bar chart** as the chart type.
2. Put `COUNTRY_CODE` into the **Keys** box.
4. Set Aggregation to **COUNT**.
5. Click **OK**. The chart displays the number of US customers.
6. From the **Cluster By** list, choose **GENDER**. The chart shows the number of customers by gender.

In [13]:
display(us_customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER,MEDIAN_INCOME_IN_ZIP
15362,Tonya Kraft,150 Horizon Circle,,Los Angeles,90041,0,CA,US,Tonya.Kraft@dodgeit.com,317-735-6797,54.0,Mrs.,Baby_Boomers,U.S.,22747484,,female,63770.0
14236,Opal Pardo,2778 Cooks Mine Road,,Los Angeles,90046,0,CA,US,Opal.Pardo@nifmail.jp,214-260-9700,53.0,Mrs.,Baby_Boomers,U.S.,22747484,,female,52641.0
14596,Liz Mendoza,1963 Maple Street,,Los Angeles,90038,0,CA,US,Liz.Mendoza@smoug.net,317-454-4146,38.0,Mrs.,Gen_Y,U.S.,22747484,,female,35144.0
10235,Mary Bates,1646 Tenmile,,Baltimore,21201,0,MD,US,Mary.T.Bates@pookmail.com,503-337-6080,,Mrs.,Gen_Z,ES,2167155A,,female,27465.0
10013,Norbert Cantu,2124 Henry Ford Avenue,,Ada,49301,0,MI,US,Norbert.T.Cantu@spambob.com,614-355-2446,55.0,Mr.,Baby_Boomers,CA,272701863,,male,118761.0
12363,David Rivas,3927 Masonic Drive,,Panola,30058,0,GA,US,David.B.Rivas@dodgeit.com,817-373-6615,34.0,Mr.,Gen_Y,IT,EXXRBW20A10C810X,,male,45792.0
10985,Grace Alfaro,2284 Whitetail Lane,,Eagleville,19403,0,PA,US,Grace.J.Alfaro@mailinator.com,562-802-9460,,Mrs.,Gen_Z,ES,8857379X,,female,76178.0
14613,Michelle Luna,2323 High Meadow Lane,,Los Angeles,90084,0,CA,US,Michelle.Luna@kobej.zzn.com,625-654-9433,38.0,Mrs.,Gen_Y,U.S.,22747484,,female,
14811,Donna Camarillo,3486 Cardinal Lane,,Los Angeles,90046,0,CA,US,Donna.Camarillo@smoug.net,350-260-3539,65.0,Mrs.,Baby_Boomers,U.S.,22747484,,female,52641.0
14044,Jimmy Hart,784 Badger Pond Lane,,Los Angeles,90057,0,CA,US,Jimmy.Hart@meritmail.net,354-186-8372,55.0,Mr.,Baby_Boomers,U.S.,22747484,,male,28035.0


Now try to cluster the customers by state.

A bar chart isn't the best way to show geographic location!

[Back to Table of Contents](#toc)
### View customer density by location with a map
Maps are a much better way to view location data than other chart types. 

Visualize customer density by US state with a map.

Run the next cell and configure the graph:
1. Choose **Map** as the chart type.
2. Put `STATE` into the **Keys** box.
4. Set Aggregation to **COUNT**.
5. Click **OK**. The map displays the number of US customers.
6. From the **Renderer** list, choose **brunel**.
   > PixieDust supports three map renderers: brunel, [mapbox](https://www.mapbox.com/) and Google. Note that the Mapbox renderer and the Google renderer require an API key or access token and supported features vary by renderer.

In [14]:
display(us_customer_df)

CUST_ID,CUSTNAME,ADDRESS1,ADDRESS2,CITY,POSTAL_CODE,POSTAL_CODE_PLUS4,STATE,COUNTRY_CODE,EMAIL_ADDRESS,PHONE_NUMBER,AGE,GenderCode,GENERATION,NATIONALITY,NATIONAL_ID,DRIVER_LICENSE,GENDER,MEDIAN_INCOME_IN_ZIP
13625,Patti Miller,2183 Scenic Way,,Los Angeles,90017,0,CA,US,Patti.Miller@smoug.net,312-435-5128,57.0,Mrs.,Baby_Boomers,U.S.,22747484,,female,21030.0
12123,Joseph Dyke,364 Buena Vista Avenue,,Morrisville,27560,0,NC,US,Joseph.E.Dyke@mailinator.com,615-780-1112,26.0,Mr.,Gen_Y,IT,RNSVAE87H57E507C,,male,77631.0
10013,Norbert Cantu,2124 Henry Ford Avenue,,Ada,49301,0,MI,US,Norbert.T.Cantu@spambob.com,614-355-2446,55.0,Mr.,Baby_Boomers,CA,272701863,,male,118761.0
13771,Anabel Harris,1964 Stonepot Road,,Los Angeles,90067,0,CA,US,Anabel.Harris@fact-mail.com,452-201-8433,77.0,Mrs.,Baby_Boomers,U.S.,22747484,,female,90972.0
15500,Clyde Basnight,3607 Locust Street,,Los Angeles,90070,0,CA,US,Clyde.Basnight@vjp.jp,356-228-9943,72.0,Mr.,Baby_Boomers,U.S.,22747484,,male,
13658,Ron Steadman,1814 657 Saints Alley,,Los Angeles,90001,0,CA,US,Ron.Steadman@ultrapostman.com,111-277-9401,51.0,Mr.,Gen_Z,U.S.,22747484,,male,35097.0
12212,Steven Parker,3881 Colonial Drive,,New Brighton,55112,0,MN,US,Steven.J.Parker@mailinator.com,505-299-1706,57.0,Mr.,Baby_Boomers,ES,7589661Y,,male,64763.0
13815,Teresa Pless,731 Science Center Drive,,Los Angeles,90065,0,CA,US,Teresa.Pless@mailinator.com,661-242-6244,65.0,Mrs.,Baby_Boomers,U.S.,22747484,,female,53635.0
15291,Ruth Sizemore,3263 Richison Drive,,Los Angeles,90058,0,CA,US,Ruth.Sizemore@uymail.com,603-740-6678,32.0,Mrs.,Gen_Y,U.S.,22747484,,female,16750.0
14598,Gracie Bruner,4292 Lighthouse Drive,,Los Angeles,90058,0,CA,US,Gracie.Bruner@smoug.net,632-517-7402,45.0,Mrs.,Gen_X,U.S.,22747484,,female,16750.0


You can explore more about customers in each state by changing the aggregation method, for example look at customer age ranges (avg, minimum, and maximum) by state. Simply Change the aggregation function to `AVG`, `MIN`, or `MAX` and choose `AGE` as value.

[Back to Table of Contents](#toc)
## Enrich demographic information with open data
You can easily combine other sources of data with your existing data. There's a lot of publicly available open data sets that can be very helpful. For example, knowing the approximate income level of your customers might help you target your marketing campaigns.

Run the next cell to load [this data set](https://apsportal.ibm.com/exchange/public/entry/view/beb8c30a3f559e58716d983671b70337) from the United States Census Bureau into your notebook. The data set contains US household income statistics compiled at the zip code geography level.

In [15]:
# Load median income information for all US ZIP codes from a public source
income_df = pixiedust.sampleData('https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af')

Downloading 'https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af' from https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af
Downloaded 6007673 bytes
Creating pySpark DataFrame for 'https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af'. Please wait...
Loading file using 'SparkSession'
Successfully created pySpark DataFrame for 'https://apsportal.ibm.com/exchange-api/v1/entries/beb8c30a3f559e58716d983671b70337/data?accessKey=1c0b5b6d465fefec1ab529fde04997af'


Now cleanse the income data set to remove the data that you don't need. Create a new DataFrame for this data:
 - The zip code, extracted from the GEOID column.
 - The column B19049e1, which contains the median household income for 2013.

In [16]:
# ------------------------------
# Helper: Extract ZIP code
# ------------------------------
def extractZIPCode(col):
    """ input: pyspark.sql.types.Column containing a geo code, like '86000US01001'
        output: ZIP code
    """
    m = re.match('^\d+US(\d\d\d\d\d)$',col)
    if m:
        return m.group(1)
    else:
        return None    
    
getZIPCodeUDF = func.udf(lambda c: extractZIPCode(c), types.StringType())
income_df = income_df.select('GEOID', 'B19049e1').withColumnRenamed('B19049e1', 'MEDIAN_INCOME_IN_ZIP').withColumn("ZIP", getZIPCodeUDF(income_df['GEOID']))
income_df

DataFrame[GEOID: string, MEDIAN_INCOME_IN_ZIP: int, ZIP: string]

Now perform a left outer join on the customer data set with the income data set, using the zip code as the join condition. For the complete syntax of joins, go to the <a href="https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame" target="_blank" rel="noopener noreferrer">pyspark DataFrame documentation</a> and scroll down to the `join` syntax. 

In [17]:
us_customer_df = us_customer_df.join(income_df, us_customer_df.POSTAL_CODE == income_df.ZIP, 'left_outer').drop('GEOID').drop('ZIP')

Now you can visualize the income distribution of your customers by zip code.
 Visualize income distribution for our customers.
Run the next cell and configure the graph:
1. Choose **Histogram** as the chart type.
2. Put `MEDIAN_INCOME_IN_ZIP` into the **Values** box and click **OK**.

In [18]:
display(us_customer_df)

The majority of your customers live in zip codes where the median income is around 40,000 USD. 

[Back to Table of Contents](#toc)


Copyright © 2017, 2018 IBM. This notebook and its source code are released under the terms of the MIT License.