# Data Exploration of US Demographics Data

**[Synopsis] The following explores the US Demographics dataset pertaining to the Police Shooting Dashboard**

**Reference**::
* [US Demographics Data](https://api.careeronestop.org/api-explorer/home/index/UnEmployment_GetUnEmploymentType)

*****

In [1]:
import os
import requests
import configparser
config = configparser.ConfigParser()
config.read(os.path.join(os.path.dirname(os.getcwd()), 'config.ini'))

['config.ini']

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, LongType

spark = SparkSession.builder.master('local[*]').appName('DataExploration').getOrCreate()

### Data Schema

In [3]:
usDemoDF = spark.read.option('header', 'true').option('inferSchema', 'true').csv(config['pathways']['usDemographics'])
usDemoDF.createOrReplaceTempView('usDemo')
usDemoDF.printSchema()

root
 |-- state_name: string (nullable = true)
 |-- county_name: string (nullable = true)
 |-- population: integer (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- min_age: double (nullable = true)
 |-- max_age: double (nullable = true)
 |-- year: integer (nullable = true)



### Preview of the Data

In [3]:
# Unfiltered preview
spark.sql('SELECT * FROM usDemo').show()

+----------+--------------+----------+--------------------+------+-------+-------+----+
|state_name|   county_name|population|                race|   sex|min_age|max_age|year|
+----------+--------------+----------+--------------------+------+-------+-------+----+
|   Alabama|Autauga County|      7473|BLACK OR AFRICAN ...|  null|   null|   null|2000|
|   Alabama|Autauga County|         7|SOME OTHER RACE A...|Female|   15.0|   17.0|2000|
|   Alabama|Autauga County|       452|         WHITE ALONE|  Male|   18.0|   19.0|2000|
|   Alabama|Autauga County|         2|         ASIAN ALONE|Female|   20.0|   20.0|2000|
|   Alabama|Autauga County|         9|AMERICAN INDIAN A...|  Male|   35.0|   39.0|2000|
|   Alabama|Autauga County|         0|         ASIAN ALONE|Female|   60.0|   61.0|2000|
|   Alabama|Autauga County|         1|         ASIAN ALONE|  Male|   10.0|   14.0|2000|
|   Alabama|Autauga County|        10|   TWO OR MORE RACES|  Male|   30.0|   34.0|2000|
|   Alabama|Autauga County|     

### Total Count

In [4]:
spark.sql('SELECT count(*) as total_count FROM usDemo').show()

+-----------+
|total_count|
+-----------+
|    3664512|
+-----------+



### Total Count for Year 2010

In [6]:
spark.sql("SELECT count(*) as total_count FROM usDemo WHERE year = '2010'").show()

+-----------+
|total_count|
+-----------+
|    1855296|
+-----------+



### Total Count for Year 2000

In [7]:
spark.sql("SELECT count(*) as total_count FROM usDemo WHERE year = '2000'").show()

+-----------+
|total_count|
+-----------+
|    1809216|
+-----------+



In [4]:
# Total population by state and county
# Note: Records that have null values in the following columns: sex, race, min_age, max_age
#       need to be filtered out. Otherwise overcounting occurs.
#       Null in those columns, represents (Across all {sex, race, age})
spark.sql("""
        SELECT
            RANK() OVER (   
                ORDER BY total_population DESC) AS rank, 
                *
        FROM
            (SELECT 
                state_name, county_name, sum(population) AS total_population 
            FROM usDemo
            WHERE year = '2010' AND race is NOT NULL AND sex is NOT NULL 
                  AND min_age is NOT NULL AND max_age is NOT NULL
            GROUP BY state_name, county_name) as a
""").show()

+----+----------+--------------------+----------------+
|rank|state_name|         county_name|total_population|
+----+----------+--------------------+----------------+
|   1|California|  Los Angeles County|         9021186|
|   2|  Illinois|         Cook County|         4760805|
|   3|     Texas|       Harris County|         3719457|
|   4|   Arizona|     Maricopa County|         3475293|
|   5|California|    San Diego County|         2837930|
|   6|California|       Orange County|         2769021|
|   7|   Florida|   Miami-Dade County|         2300632|
|   8|  New York|        Kings County|         2286552|
|   9|     Texas|       Dallas County|         2149494|
|  10|  New York|       Queens County|         2056083|
|  11|California|    Riverside County|         1995011|
|  12|California|San Bernardino Co...|         1855546|
|  13|    Nevada|        Clark County|         1792603|
|  14|Washington|         King County|         1777171|
|  15|  Michigan|        Wayne County|         1

### Unique Race Values

In [5]:
# Normalize race values 
spark.sql("""
    SELECT DISTINCT race
    FROM (    
        SELECT
            CASE 
                WHEN race like 'AMERICAN INDIAN%' then 'American Indian'
                WHEN race like 'SOME OTHER RACE%' then 'Other'
                WHEN race like 'WHITE%' then 'White'
                WHEN race like 'ASIAN%' then 'Asian'
                WHEN race like 'NATIVE HAWAIIAN%' then 'Native Hawaiian'
                WHEN race like 'TWO OR MORE%' then 'Mixed'
                WHEN race like 'BLACK%' then 'African American'
            END as race
        FROM usDemo) as a
""").show()

+----------------+
|            race|
+----------------+
| Native Hawaiian|
|            null|
|African American|
|           Other|
| American Indian|
|           Mixed|
|           White|
|           Asian|
+----------------+



### Population by State, County, and Race

In [7]:
# Normalize and Select relative columns
# Note: Records that have null values in the following columns: sex, race, min_age, max_age
#       need to be filtered out. Otherwise overcounting occurs.
#       Null in those columns, represents (Across all {sex, race, age})
usDemoNorm = spark.sql("""   
                        SELECT
                            state_name,
                            county_name,
                            sex,
                            min_age,
                            max_age,
                            year,
                            CASE 
                                WHEN race like 'AMERICAN INDIAN%' then 'American Indian'
                                WHEN race like 'SOME OTHER RACE%' then 'Other'
                                WHEN race like 'WHITE%' then 'White'
                                WHEN race like 'ASIAN%' then 'Asian'
                                WHEN race like 'NATIVE HAWAIIAN%' then 'Native Hawaiian'
                                WHEN race like 'TWO OR MORE%' then 'Mixed'
                                WHEN race like 'BLACK%' then 'African American'
                            END as race,
                            population
                        FROM usDemo
                        WHERE year = '2010' AND race is NOT NULL AND sex is NOT NULL 
                              AND min_age is NOT NULL AND max_age is NOT NULL
                        """)
usDemoNorm.createOrReplaceTempView('usDemoNorm')

In [8]:
# Population by state, county, sex and race
spark.sql("""
            SELECT 
                state_name, county_name, sex, race, sum(population) AS population 
            FROM usDemoNorm
            GROUP BY state_name, county_name, sex, race
            ORDER BY state_name, county_name, sex, population DESC
""").show()

+----------+--------------+------+----------------+----------+
|state_name|   county_name|   sex|            race|population|
+----------+--------------+------+----------------+----------+
|   Alabama|Autauga County|Female|           White|     20152|
|   Alabama|Autauga County|Female|African American|      4748|
|   Alabama|Autauga County|Female|           Mixed|       394|
|   Alabama|Autauga County|Female|           Asian|       262|
|   Alabama|Autauga County|Female|           Other|       207|
|   Alabama|Autauga County|Female| American Indian|       122|
|   Alabama|Autauga County|Female| Native Hawaiian|        12|
|   Alabama|Autauga County|  Male|           White|     19549|
|   Alabama|Autauga County|  Male|African American|      4151|
|   Alabama|Autauga County|  Male|           Mixed|       322|
|   Alabama|Autauga County|  Male|           Other|       212|
|   Alabama|Autauga County|  Male|           Asian|       184|
|   Alabama|Autauga County|  Male| American Indian|    

In [10]:
# Total population by state and county
spark.sql("""
        SELECT
            RANK() OVER (   
                ORDER BY population DESC) AS rank, 
                *
        FROM
            (SELECT 
                state_name, county_name, sum(population) AS population 
            FROM usDemoNorm
            GROUP BY state_name, county_name) as a
""").show()

+----+----------+--------------------+----------+
|rank|state_name|         county_name|population|
+----+----------+--------------------+----------+
|   1|California|  Los Angeles County|   9021186|
|   2|  Illinois|         Cook County|   4760805|
|   3|     Texas|       Harris County|   3719457|
|   4|   Arizona|     Maricopa County|   3475293|
|   5|California|    San Diego County|   2837930|
|   6|California|       Orange County|   2769021|
|   7|   Florida|   Miami-Dade County|   2300632|
|   8|  New York|        Kings County|   2286552|
|   9|     Texas|       Dallas County|   2149494|
|  10|  New York|       Queens County|   2056083|
|  11|California|    Riverside County|   1995011|
|  12|California|San Bernardino Co...|   1855546|
|  13|    Nevada|        Clark County|   1792603|
|  14|Washington|         King County|   1777171|
|  15|  Michigan|        Wayne County|   1667815|
|  16|     Texas|      Tarrant County|   1646668|
|  17|California|  Santa Clara County|   1629703|


### California, Los Angeles County Subset

In [12]:
# Group by state, county, sex and race
spark.sql(""" 
    SELECT 
        state_name, county_name, sex, race, sum(population) AS population 
    FROM usDemoNorm
    WHERE state_name = 'California' and county_name = 'Los Angeles County'
    GROUP BY state_name, county_name, sex, race
    ORDER BY population DESC, sex
""").show(25)

+----------+------------------+------+----------------+----------+
|state_name|       county_name|   sex|            race|population|
+----------+------------------+------+----------------+----------+
|California|Los Angeles County|  Male|           White|   2277232|
|California|Los Angeles County|Female|           White|   2265836|
|California|Los Angeles County|  Male|           Other|    979288|
|California|Los Angeles County|Female|           Other|    963015|
|California|Los Angeles County|Female|           Asian|    674197|
|California|Los Angeles County|  Male|           Asian|    591477|
|California|Los Angeles County|Female|African American|    420898|
|California|Los Angeles County|  Male|African American|    372768|
|California|Los Angeles County|Female|           Mixed|    195525|
|California|Los Angeles County|  Male|           Mixed|    189405|
|California|Los Angeles County|  Male| American Indian|     34586|
|California|Los Angeles County|Female| American Indian|     32

### New York, Queens County Subset

In [17]:
# Group by state, county, sex and race
spark.sql(""" 
    SELECT 
        state_name, county_name, sex, race, sum(population) AS population 
    FROM usDemoNorm
    WHERE state_name = 'New York' and county_name = 'Queens County'
    GROUP BY state_name, county_name, sex, race
    ORDER BY population DESC, sex
""").show(25)

+----------+-------------+------+----------------+----------+
|state_name|  county_name|   sex|            race|population|
+----------+-------------+------+----------------+----------+
|  New York|Queens County|Female|           White|    410642|
|  New York|Queens County|  Male|           White|    398756|
|  New York|Queens County|Female|           Asian|    246113|
|  New York|Queens County|  Male|           Asian|    233225|
|  New York|Queens County|Female|African American|    218094|
|  New York|Queens County|  Male|African American|    177294|
|  New York|Queens County|  Male|           Other|    136548|
|  New York|Queens County|Female|           Other|    128769|
|  New York|Queens County|Female|           Mixed|     46749|
|  New York|Queens County|  Male|           Mixed|     44720|
|  New York|Queens County|  Male| American Indian|      7259|
|  New York|Queens County|Female| American Indian|      6529|
|  New York|Queens County|Female| Native Hawaiian|       699|
|  New Y