# Data Exploration for Police Shooting Data

**[Synopsis] The following explores the US Unemployment dataset pertaining to the Police Shooting Dashboard**

**Reference**:
* [US Unemployment API](https://api.careeronestop.org/api-explorer/home/index/UnEmployment_GetUnEmploymentType)

Using the following endpoints:
* /v1/unemployment/{userId}/{location}/{unemploymentType}

**Example**:
* https://api.careeronestop.org/v1/unemployment/{userId}/CA/county

*****

In [1]:
import requests
import configparser
config = configparser.ConfigParser()
config.read('config.ini')

['config.ini']

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType, BooleanType, FloatType

spark = SparkSession.builder.master('local[*]').appName('DataExploration').getOrCreate()

### US Cities Dataset

In [3]:
# Creating Dataframe and Temp View
usCitiesDF = spark.read.option('header', 'True').option('inferSchema', 'true').csv(config['pathways']['usCities'])
usCitiesDF.createOrReplaceTempView('usCities')

# Relevant Columns, 
usCitiesDF.select('state_id', 'state_name', 'city', 'county_name').show()

+--------+--------------------+-------------+--------------------+
|state_id|          state_name|         city|         county_name|
+--------+--------------------+-------------+--------------------+
|      NY|            New York|     New York|            New York|
|      CA|          California|  Los Angeles|         Los Angeles|
|      IL|            Illinois|      Chicago|                Cook|
|      FL|             Florida|        Miami|          Miami-Dade|
|      TX|               Texas|       Dallas|              Dallas|
|      PA|        Pennsylvania| Philadelphia|        Philadelphia|
|      TX|               Texas|      Houston|              Harris|
|      GA|             Georgia|      Atlanta|              Fulton|
|      DC|District of Columbia|   Washington|District of Columbia|
|      MA|       Massachusetts|       Boston|             Suffolk|
|      AZ|             Arizona|      Phoenix|            Maricopa|
|      WA|          Washington|      Seattle|                K

## Test out Unemployment API

In [4]:
headersAuth = {
    'Authorization': 'Bearer '+ config['unemploymentAPI']['unemploymentAPIKey']
}

### Preview of Unemployment Dataset

In [5]:
# 
endPointTemplate = 'https://api.careeronestop.org/v1/unemployment/{}/{}/{}'
url = endPointTemplate.format(config['unemploymentAPI']['unemploymentUserID'], 'CA', 'county')
response = requests.get(url, headers=headersAuth, verify=True)
response = response.json()['CountyList']
response[:3]

[{'AreaName': 'Alameda County',
  'UnEmpCount': '52111',
  'UnEmpRate': '6.5',
  'AreaType': '04',
  'Stfips': '06',
  'AreaID': '000001'},
 {'AreaName': 'Alpine County',
  'UnEmpCount': '46',
  'UnEmpRate': '7.6',
  'AreaType': '04',
  'Stfips': '06',
  'AreaID': '000003'},
 {'AreaName': 'Amador County',
  'UnEmpCount': '1074',
  'UnEmpRate': '7.4',
  'AreaType': '04',
  'Stfips': '06',
  'AreaID': '000005'}]

In [6]:
# Convert response to RDD to Dataframe
caUEDataRDD = spark.sparkContext.parallelize(response)
caUEDataDF = spark.read.json(caUEDataRDD)
caUEDataDF = caUEDataDF.withColumn("State", func.lit('CA'))
caUEDataDF.select('AreaID', 'AreaName', 'State', 'AreaType', 'Stfips', 'UnEmpCount', 'UnEmpRate').show(5)

+------+----------------+-----+--------+------+----------+---------+
|AreaID|        AreaName|State|AreaType|Stfips|UnEmpCount|UnEmpRate|
+------+----------------+-----+--------+------+----------+---------+
|000001|  Alameda County|   CA|      04|    06|     52111|      6.5|
|000003|   Alpine County|   CA|      04|    06|        46|      7.6|
|000005|   Amador County|   CA|      04|    06|      1074|      7.4|
|000007|    Butte County|   CA|      04|    06|      6917|      7.5|
|000009|Calaveras County|   CA|      04|    06|      1351|      6.3|
+------+----------------+-----+--------+------+----------+---------+
only showing top 5 rows



In [7]:
# Selecting relevant columns
caUEDataDF.createOrReplaceTempView("caUE")
caUEDataDF = spark.sql("""
        SELECT 
            State as state_id, 
            REPLACE(AreaName, ' County', '') as county_name,
            UnEmpCount as unemployment_count,
            UnEmpRate as unemployment_rate
        FROM caUE
        """)

caUEDataDF.createOrReplaceTempView("caUE")
caUEDataDF.show(5)

+--------+-----------+------------------+-----------------+
|state_id|county_name|unemployment_count|unemployment_rate|
+--------+-----------+------------------+-----------------+
|      CA|    Alameda|             52111|              6.5|
|      CA|     Alpine|                46|              7.6|
|      CA|     Amador|              1074|              7.4|
|      CA|      Butte|              6917|              7.5|
|      CA|  Calaveras|              1351|              6.3|
+--------+-----------+------------------+-----------------+
only showing top 5 rows



### Joining US Cities and US Unemployment Data

In [8]:
spark.sql("""
    SELECT
        DISTINCT
        usc.state_id,
        usc.state_name,
        usc.county_name,
        usue.unemployment_count,
        usue.unemployment_rate
    FROM usCities as usc
    JOIN caUE as usue
    on usc.county_name = usue.county_name
    WHERE usc.state_id = 'CA'
""").show(30)

+--------+----------+---------------+------------------+-----------------+
|state_id|state_name|    county_name|unemployment_count|unemployment_rate|
+--------+----------+---------------+------------------+-----------------+
|      CA|California|           Kern|             40925|             11.1|
|      CA|California|  San Francisco|             29475|              5.4|
|      CA|California|       Monterey|             21601|             10.4|
|      CA|California|San Luis Obispo|              7603|              5.8|
|      CA|California|           Mono|               636|              7.8|
|      CA|California|         Madera|              5857|              9.7|
|      CA|California|         Sutter|              4457|             10.1|
|      CA|California|         Fresno|             43577|              9.9|
|      CA|California|       Mariposa|               630|              9.4|
|      CA|California|      Calaveras|              1351|              6.3|
|      CA|California|    

### US Demographics Dataset

In [9]:
# Creating Dataframe and Temp View
usDemoDF = spark.read.option('header', 'True').option('inferSchema', 'true').csv(config['pathways']['usDemographics'])
usDemoDF.createOrReplaceTempView('usDemo')

# Relevant Columns, 
spark.sql("SELECT * FROM usDemo").show()

+----------+--------------+----------+--------------------+------+-------+-------+----+
|state_name|   county_name|population|                race|   sex|min_age|max_age|year|
+----------+--------------+----------+--------------------+------+-------+-------+----+
|   Alabama|Autauga County|      7473|BLACK OR AFRICAN ...|  null|   null|   null|2000|
|   Alabama|Autauga County|         7|SOME OTHER RACE A...|Female|   15.0|   17.0|2000|
|   Alabama|Autauga County|       452|         WHITE ALONE|  Male|   18.0|   19.0|2000|
|   Alabama|Autauga County|         2|         ASIAN ALONE|Female|   20.0|   20.0|2000|
|   Alabama|Autauga County|         9|AMERICAN INDIAN A...|  Male|   35.0|   39.0|2000|
|   Alabama|Autauga County|         0|         ASIAN ALONE|Female|   60.0|   61.0|2000|
|   Alabama|Autauga County|         1|         ASIAN ALONE|  Male|   10.0|   14.0|2000|
|   Alabama|Autauga County|        10|   TWO OR MORE RACES|  Male|   30.0|   34.0|2000|
|   Alabama|Autauga County|     

In [10]:
cleanUSDemo = spark.sql("""
    SELECT 
        state_name, replace(county_name, ' County', '') as county_name,
            CASE 
                WHEN race like 'AMERICAN INDIAN%' then 'American Indian'
                WHEN race like 'SOME OTHER RACE%' then 'Other'
                WHEN race like 'WHITE%' then 'White'
                WHEN race like 'ASIAN%' then 'Asian'
                WHEN race like 'NATIVE HAWAIIAN%' then 'Native Hawaiian'
                WHEN race like 'TWO OR MORE%' then 'Mixed'
                WHEN race like 'BLACK%' then 'African American'
            END as race,
            sex,
            sum(population) as population
    FROM usDemo
    WHERE year = '2010' AND race is NOT NULL AND sex is NOT NULL 
                  AND min_age is NOT NULL AND max_age is NOT NULL
    GROUP BY state_name, county_name, race, sex
    ORDER BY state_name, county_name, race
""")

cleanUSDemo.createOrReplaceTempView('usDemoClean')
cleanUSDemo.show()

+----------+-----------+----------------+------+----------+
|state_name|county_name|            race|   sex|population|
+----------+-----------+----------------+------+----------+
|   Alabama|    Autauga| American Indian|Female|       122|
|   Alabama|    Autauga| American Indian|  Male|       106|
|   Alabama|    Autauga|           Asian|Female|       262|
|   Alabama|    Autauga|           Asian|  Male|       184|
|   Alabama|    Autauga|African American|Female|      4748|
|   Alabama|    Autauga|African American|  Male|      4151|
|   Alabama|    Autauga| Native Hawaiian|Female|        12|
|   Alabama|    Autauga| Native Hawaiian|  Male|        20|
|   Alabama|    Autauga|           Other|  Male|       212|
|   Alabama|    Autauga|           Other|Female|       207|
|   Alabama|    Autauga|           Mixed|  Male|       322|
|   Alabama|    Autauga|           Mixed|Female|       394|
|   Alabama|    Autauga|           White|Female|     20152|
|   Alabama|    Autauga|           White

In [11]:
spark.sql("SELECT * from usDemoClean").show(5)

+----------+-----------+----------------+------+----------+
|state_name|county_name|            race|   sex|population|
+----------+-----------+----------------+------+----------+
|   Alabama|    Autauga| American Indian|  Male|       106|
|   Alabama|    Autauga| American Indian|Female|       122|
|   Alabama|    Autauga|           Asian|Female|       262|
|   Alabama|    Autauga|           Asian|  Male|       184|
|   Alabama|    Autauga|African American|Female|      4748|
+----------+-----------+----------------+------+----------+
only showing top 5 rows



### Joining US Unemployment Data and US population

In [49]:
spark.sql("""
    WITH
        demo as (
                SELECT
                    usDC.state_name,
                    usDC.county_name,
                    usDC.race,
                    sum(usDC.population) as population
                FROM usDemoClean AS usDC
                WHERE usDC.state_name = 'California'
                GROUP BY usDC.state_name, usDC.county_name, usDC.race
                ORDER BY state_name, county_name, race
        )
    SELECT
        demo.state_name,
        demo.county_name,
        demo.race,
        demo.population,
        usue.unemployment_count,
        usue.unemployment_rate
    FROM demo
    JOIN caUE as usue
    on demo.county_name = usue.county_name
""").show(10)

+----------+-----------+----------------+----------+------------------+-----------------+
|state_name|county_name|            race|population|unemployment_count|unemployment_rate|
+----------+-----------+----------------+----------+------------------+-----------------+
|California|     Plumas|African American|       179|               867|               12|
|California|     Plumas| American Indian|       515|               867|               12|
|California|     Plumas|           Asian|       126|               867|               12|
|California|     Plumas|           Mixed|       636|               867|               12|
|California|     Plumas| Native Hawaiian|        18|               867|               12|
|California|     Plumas|           Other|       556|               867|               12|
|California|     Plumas|           White|     16670|               867|               12|
|California|      Kings|African American|     10372|              5816|             10.5|
|Californi