# Data Exploration for Police Shooting Data

**[Synopsis] The following explores the Police Shootings dataset pertaining to the Police Shooting Dashboard**

**Note:** Can schedule a weekly download of this dataset, since it updates frequently.

**Reference**:
* [Police Shootings Data](https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv)

*****

In [5]:
import requests
import configparser
config = configparser.ConfigParser()
config.read('config.ini')

['config.ini']

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType, BooleanType, FloatType

spark = SparkSession.builder.master('local[*]').appName('DataExploration').getOrCreate()

## Police Shooting Data

In [7]:
# Schema of the Police Shootings Dataset
psSchema = StructType([\
                       StructField('id', IntegerType(), False),
                       StructField('name', StringType(), True),
                       StructField('date', DateType(), True),
                       StructField('manner_of_death', StringType(), True),
                       StructField('armed', StringType(), True),
                       StructField('age', IntegerType(), True),
                       StructField('gender', StringType(), True),
                       StructField('race', StringType(), True),
                       StructField('city', StringType(), True),
                       StructField('state', StringType(), True),
                       StructField('s_o_m_i', BooleanType(), True),
                       StructField('threat_level', StringType(), True),
                       StructField('flee', StringType(), True),
                       StructField('body_camera', BooleanType(), True),
                       StructField('longitude', FloatType(), True),
                       StructField('latitude', FloatType(), True),
                       StructField('is_geocoding_exact', BooleanType(), True)
                        ])

In [8]:
# Creating Dataframe and Temp View
psDF = spark.read.option('header', 'True').schema(psSchema).csv(config['pathways']['policeShootings'])
psDF.createOrReplaceTempView('policeShootings')

### Preview of the Data

In [9]:
# First half of the columns
psDF.select('id', 'name', 'date', 'manner_of_death', 'armed', 'age', 'gender', 'race', 'city', 'state').show(5)

+---+------------------+----------+----------------+----------+---+------+----+-------------+-----+
| id|              name|      date| manner_of_death|     armed|age|gender|race|         city|state|
+---+------------------+----------+----------------+----------+---+------+----+-------------+-----+
|  3|        Tim Elliot|2015-01-02|            shot|       gun| 53|     M|   A|      Shelton|   WA|
|  4|  Lewis Lee Lembke|2015-01-02|            shot|       gun| 47|     M|   W|        Aloha|   OR|
|  5|John Paul Quintero|2015-01-03|shot and Tasered|   unarmed| 23|     M|   H|      Wichita|   KS|
|  8|   Matthew Hoffman|2015-01-04|            shot|toy weapon| 32|     M|   W|San Francisco|   CA|
|  9| Michael Rodriguez|2015-01-04|            shot|  nail gun| 39|     M|   H|        Evans|   CO|
+---+------------------+----------+----------------+----------+---+------+----+-------------+-----+
only showing top 5 rows



In [10]:
# Second half of the columns
psDF.select('s_o_m_i', 'threat_level', 'flee', 'body_camera', 
            'longitude', 'latitude', 'is_geocoding_exact').show(5)

+-------+------------+-----------+-----------+---------+--------+------------------+
|s_o_m_i|threat_level|       flee|body_camera|longitude|latitude|is_geocoding_exact|
+-------+------------+-----------+-----------+---------+--------+------------------+
|   true|      attack|Not fleeing|      false| -123.122|  47.247|              true|
|  false|      attack|Not fleeing|      false| -122.892|  45.487|              true|
|  false|       other|Not fleeing|      false|  -97.281|  37.695|              true|
|   true|      attack|Not fleeing|      false| -122.422|  37.763|              true|
|  false|      attack|Not fleeing|      false| -104.692|  40.384|              true|
+-------+------------+-----------+-----------+---------+--------+------------------+
only showing top 5 rows



### Relevant Columns for Joining to Other Datasets

In [7]:
# Select relevant columns
psDF.select('id', 'name', 'date', 'city', 'state').show(5)

+---+------------------+----------+-------------+-----+
| id|              name|      date|         city|state|
+---+------------------+----------+-------------+-----+
|  3|        Tim Elliot|2015-01-02|      Shelton|   WA|
|  4|  Lewis Lee Lembke|2015-01-02|        Aloha|   OR|
|  5|John Paul Quintero|2015-01-03|      Wichita|   KS|
|  8|   Matthew Hoffman|2015-01-04|San Francisco|   CA|
|  9| Michael Rodriguez|2015-01-04|        Evans|   CO|
+---+------------------+----------+-------------+-----+
only showing top 5 rows



### Relevant Columns to the Shooting

In [8]:
# Select relevant columns
psDF.select('name', 'date', 'manner_of_death', 'armed', 'age', 'gender', 'race', 's_o_m_i', 'threat_level', 'flee', 'body_camera').show(20)

+--------------------+----------+----------------+----------+---+------+----+-------+------------+-----------+-----------+
|                name|      date| manner_of_death|     armed|age|gender|race|s_o_m_i|threat_level|       flee|body_camera|
+--------------------+----------+----------------+----------+---+------+----+-------+------------+-----------+-----------+
|          Tim Elliot|2015-01-02|            shot|       gun| 53|     M|   A|   true|      attack|Not fleeing|      false|
|    Lewis Lee Lembke|2015-01-02|            shot|       gun| 47|     M|   W|  false|      attack|Not fleeing|      false|
|  John Paul Quintero|2015-01-03|shot and Tasered|   unarmed| 23|     M|   H|  false|       other|Not fleeing|      false|
|     Matthew Hoffman|2015-01-04|            shot|toy weapon| 32|     M|   W|   true|      attack|Not fleeing|      false|
|   Michael Rodriguez|2015-01-04|            shot|  nail gun| 39|     M|   H|  false|      attack|Not fleeing|      false|
|   Kenneth Joe 

### Context of the shooting

In [9]:
# Looking at manner of death and situation details
spark.sql("SELECT manner_of_death, armed, race, s_o_m_i, threat_level, flee, body_camera FROM policeShootings").show(20)

+----------------+----------+----+-------+------------+-----------+-----------+
| manner_of_death|     armed|race|s_o_m_i|threat_level|       flee|body_camera|
+----------------+----------+----+-------+------------+-----------+-----------+
|            shot|       gun|   A|   true|      attack|Not fleeing|      false|
|            shot|       gun|   W|  false|      attack|Not fleeing|      false|
|shot and Tasered|   unarmed|   H|  false|       other|Not fleeing|      false|
|            shot|toy weapon|   W|   true|      attack|Not fleeing|      false|
|            shot|  nail gun|   H|  false|      attack|Not fleeing|      false|
|            shot|       gun|   W|  false|      attack|Not fleeing|      false|
|            shot|       gun|   H|  false|      attack|        Car|      false|
|            shot|       gun|   W|  false|      attack|Not fleeing|      false|
|            shot|   unarmed|   W|  false|       other|Not fleeing|       true|
|            shot|toy weapon|   B|  fals

### Unique Values of Manner of Death & Count

In [10]:
# Manner of death
spark.sql("SELECT DISTINCT manner_of_death FROM policeShootings").show()

+----------------+
| manner_of_death|
+----------------+
|shot and Tasered|
|            shot|
+----------------+



In [11]:
# Number of deaths by method of fleeing
spark.sql("SELECT manner_of_death, count(*) as count FROM policeShootings GROUP BY manner_of_death ORDER BY count DESC").show()

+----------------+-----+
| manner_of_death|count|
+----------------+-----+
|            shot| 6202|
|shot and Tasered|  331|
+----------------+-----+



### Unique Values of Armed & Count

In [12]:
# Armed
spark.sql("SELECT DISTINCT armed FROM policeShootings").show()

+-------------------+
|              armed|
+-------------------+
|         metal pole|
|         motorcycle|
|           crossbow|
|                pen|
|           nail gun|
|  incendiary device|
| contractor's level|
|     Airsoft pistol|
|              knife|
|    hatchet and gun|
|            stapler|
|guns and explosives|
|       bean-bag gun|
|         microphone|
|            unarmed|
|          tire iron|
|        garden tool|
|               null|
|             wrench|
|  knife and vehicle|
+-------------------+
only showing top 20 rows



In [13]:
# Number of deaths by weapon
spark.sql("SELECT armed, count(*) as count FROM policeShootings GROUP BY armed ORDER BY count DESC").show()

+---------------+-----+
|          armed|count|
+---------------+-----+
|            gun| 3761|
|          knife|  948|
|        unarmed|  419|
|     toy weapon|  223|
|        vehicle|  212|
|           null|  207|
|   undetermined|  182|
| unknown weapon|   82|
|        machete|   50|
|          Taser|   33|
|             ax|   24|
|          sword|   23|
|  gun and knife|   22|
|   baseball bat|   20|
|         hammer|   18|
|gun and vehicle|   17|
|    screwdriver|   16|
|     metal pipe|   16|
|        hatchet|   14|
|         BB gun|   14|
+---------------+-----+
only showing top 20 rows



### Unique Values of Flee & Count

In [14]:
spark.sql("SELECT DISTINCT flee FROM policeShootings").show()

+-----------+
|       flee|
+-----------+
|Not fleeing|
|       null|
|       Foot|
|        Car|
|      Other|
+-----------+



In [15]:
# Number of deaths by method of fleeing
spark.sql("SELECT flee, count(*) as count FROM policeShootings GROUP BY flee ORDER BY count DESC").show()

+-----------+-----+
|       flee|count|
+-----------+-----+
|Not fleeing| 3929|
|        Car| 1051|
|       Foot|  839|
|       null|  469|
|      Other|  245|
+-----------+-----+



### Unique Values of Race & Count

In [16]:
# Race
spark.sql("""
            SELECT 
                CASE
                    WHEN race = 'A' THEN 'Asian'
                    WHEN race = 'B' THEN 'Black'
                    WHEN race = 'N' THEN 'Native'
                    WHEN race = 'H' THEN 'Hispanic'
                    WHEN race = 'W' THEN 'White'
                    WHEN race = 'O' THEN 'Other'
                    ELSE 'Not Documented'
                END as race
                FROM    
                  (SELECT DISTINCT race FROM policeShootings) as ps
        """).show()

+--------------+
|          race|
+--------------+
|Not Documented|
|         Black|
|         Other|
|         Asian|
|        Native|
|         White|
|      Hispanic|
+--------------+



In [21]:
# Number of deaths by race
policeShootingsNorm = spark.sql("""
SELECT race, count(*) as count FROM 
            (SELECT 
                CASE
                    WHEN race = 'A' THEN 'Asian'
                    WHEN race = 'B' THEN 'Black'
                    WHEN race = 'N' THEN 'Native'
                    WHEN race = 'H' THEN 'Hispanic'
                    WHEN race = 'W' THEN 'White'
                    WHEN race = 'O' THEN 'Other'
                    ELSE 'Not Documented'
                END as race
                FROM    
                  policeShootings)
                GROUP BY race ORDER BY count DESC

""")
policeShootingsNorm.show()

+--------------+-----+
|          race|count|
+--------------+-----+
|         White| 2962|
|         Black| 1551|
|      Hispanic| 1081|
|Not Documented|  695|
|         Asian|  106|
|        Native|   91|
|         Other|   47|
+--------------+-----+



In [25]:
policeShootingsNorm = spark.sql("""
        SELECT 
            name,
            manner_of_death,
            armed,
            s_o_m_i,
            threat_level,
            flee,
            body_camera,
            CASE
                WHEN race = 'A' THEN 'Asian'
                WHEN race = 'B' THEN 'Black'
                WHEN race = 'N' THEN 'Native'
                WHEN race = 'H' THEN 'Hispanic'
                WHEN race = 'W' THEN 'White'
                WHEN race = 'O' THEN 'Other'
                ELSE 'Not Documented'
            END as race
        FROM    
          policeShootings

""")
policeShootingsNorm.createOrReplaceTempView('policeShootingsNorm')

### Exploring Records where NULLs Exist

In [27]:
# Shootings where race of victim isn't documented
spark.sql("""
    SELECT name, manner_of_death, armed, race, s_o_m_i, threat_level, flee, body_camera 
    FROM policeShootingsNorm 
    WHERE race = 'Not Documented'
""").show()

+--------------------+----------------+--------------+--------------+-------+------------+-----------+-----------+
|                name| manner_of_death|         armed|          race|s_o_m_i|threat_level|       flee|body_camera|
+--------------------+----------------+--------------+--------------+-------+------------+-----------+-----------+
|    William Campbell|            shot|           gun|Not Documented|  false|      attack|Not fleeing|      false|
|  John Marcell Allen|            shot|           gun|Not Documented|  false|      attack|Not fleeing|      false|
|          Mark Smith|shot and Tasered|          null|Not Documented|  false|      attack|      Other|      false|
|          Joseph Roy|            shot|         knife|Not Documented|   true|       other|Not fleeing|      false|
|James Anthony Morris|            shot|           gun|Not Documented|   true|      attack|Not fleeing|      false|
|       James Johnson|            shot|           gun|Not Documented|   true|   

In [19]:
# Shootings where victims fled status wasn't documented
spark.sql('SELECT name, manner_of_death, armed, race, s_o_m_i, threat_level, flee, body_camera FROM policeShootingsNorm where flee is NULL').show()

+--------------------+----------------+--------------+----+-------+------------+----+-----------+
|                name| manner_of_death|         armed|race|s_o_m_i|threat_level|flee|body_camera|
+--------------------+----------------+--------------+----+-------+------------+----+-----------+
|      Ernesto Gamino|            shot|  undetermined|   H|  false|undetermined|null|      false|
|   Randy Allen Smith|            shot|           gun|   B|  false|      attack|null|      false|
|     Zachary Grigsby|            shot|           gun|   W|  false|      attack|null|      false|
|         Roy Carreon|            shot|         knife|   H|  false|      attack|null|      false|
|   Efrain Villanueva|            shot|unknown weapon|null|  false|      attack|null|      false|
|        Bettie Jones|            shot|       unarmed|   B|  false|       other|null|      false|
|  John Randell Veach|            shot|  undetermined|null|  false|undetermined|null|      false|
|John Alan Chamber..

### San Diego City, California

In [20]:
spark.sql("""
    SELECT name, manner_of_death, armed, race, s_o_m_i, threat_level, flee, body_camera 
    FROM policeShootingsNorm 
    WHERE city = 'San Diego'
""").show()

+--------------------+----------------+----------+----+-------+------------+-----------+-----------+
|                name| manner_of_death|     armed|race|s_o_m_i|threat_level|       flee|body_camera|
+--------------------+----------------+----------+----+-------+------------+-----------+-----------+
|Fridoon Zalbeg Nehad|            shot|   unarmed|   O|   true|       other|Not fleeing|       true|
|        Dennis  Fiel|            shot|       gun|   W|   true|      attack|       Foot|       true|
|          Ton Nguyen|            shot|     knife|   A|  false|       other|Not fleeing|      false|
|        Robert Hober|            shot|box cutter|   W|  false|       other|Not fleeing|       true|
|      Lamontez Jones|            shot|toy weapon|   B|  false|      attack|Not fleeing|      false|
|       Joshua Sisson|            shot|     knife|   W|  false|       other|Not fleeing|       true|
|Thongsoune Vilaysane|            shot|      null|   A|  false|       other|        Car|   