# Data Exploration for Police Shooting Data

**[Synopsis] The following explores the Police Shootings dataset pertaining to the Police Shooting Dashboard**

**Note:** Can schedule a weekly download of this dataset, since it updates frequently.

**Reference**:
* [Police Shootings Data](https://github.com/washingtonpost/data-police-shootings/tree/master/v2)

*****

In [1]:
import os
import configparser
config = configparser.ConfigParser()
config.read(os.path.join(os.path.dirname(os.getcwd()), 'config.ini'))

['/home/lpascual/Projects/PoliceShootingsDashboard/config.ini']

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType, BooleanType, FloatType

spark = SparkSession.builder.master('local[*]').appName('DataExploration-PoliceShootings').getOrCreate()

In [2]:
spark.stop()

## Police Shooting Data

In [3]:
%%capture
# Schema of the Police Shootings Dataset (Sept 2021)
"""
psSchema = StructType([\
                       StructField('id', IntegerType(), False),
                       StructField('name', StringType(), True),
                       StructField('date', DateType(), True),
                       StructField('manner_of_death', StringType(), True),
                       StructField('armed', StringType(), True),
                       StructField('age', IntegerType(), True),
                       StructField('gender', StringType(), True),
                       StructField('race', StringType(), True),
                       StructField('city', StringType(), True),
                       StructField('state', StringType(), True),
                       StructField('s_o_m_i', BooleanType(), True),
                       StructField('threat_level', StringType(), True),
                       StructField('flee', StringType(), True),
                       StructField('body_camera', BooleanType(), True),
                       StructField('longitude', FloatType(), True),
                       StructField('latitude', FloatType(), True),
                       StructField('is_geocoding_exact', BooleanType(), True)
                        ])
"""                        

**Note:** Certain fields were renamed, but the notable changes were the following:
* **Agency ID:** was added (Dataset now links with another dataset containing agency information)
* **County:** was added (Supplement city & state)
* **Race Source:** was added (How race information was obtained)
* **Manner of Death:** was removed (Didn't bring much value; was either shot, or shot and tased)

For more information refer to:<br/>
https://github.com/washingtonpost/data-police-shootings/tree/master/v2

In [4]:
# Schema of the Police Shootings Dataset (Oct 2023)
psSchema = StructType([\
                       StructField('id', IntegerType(), False),
                       StructField('date', DateType(), True),
                       StructField('threat_level', StringType(), True),
                       StructField('flee', StringType(), True),
                       StructField('armed', StringType(), True),
                       StructField('city', StringType(), True),
                       StructField('county', StringType(), True),
                       StructField('state', StringType(), True),
                       StructField('longitude', FloatType(), True),
                       StructField('latitude', FloatType(), True),
                       StructField('location_precision', StringType(), True),
                       StructField('name', StringType(), True),
                       StructField('age', IntegerType(), True),
                       StructField('gender', StringType(), True),
                       StructField('race', StringType(), True),
                       StructField('race_source', StringType(), True),
                       StructField('mental_illness', BooleanType(), True),
                       StructField('body_camera', BooleanType(), True),
                       StructField('agency_ids', IntegerType(), True)
                        ])                     

In [5]:
# Creating Dataframe and Temp View
psDF = spark.read.option('header', 'True').schema(psSchema).csv(config['pathways']['police_shootings'])
psDF.createOrReplaceTempView('policeShootings')

### Preview of the Data

In [6]:
# First half of the columns
psDF.select('id', 'name', 'date', 'armed', 'age', 'gender', 'race', 'city', 'county', 'state').show(5)

+---+------------------+----------+-------+---+------+----+-------------+-------------+-----+
| id|              name|      date|  armed|age|gender|race|         city|       county|state|
+---+------------------+----------+-------+---+------+----+-------------+-------------+-----+
|  3|        Tim Elliot|2015-01-02|    gun| 53|  male|   A|      Shelton|        Mason|   WA|
|  4|  Lewis Lee Lembke|2015-01-02|    gun| 47|  male|   W|        Aloha|   Washington|   OR|
|  5|John Paul Quintero|2015-01-03|unarmed| 23|  male|   H|      Wichita|     Sedgwick|   KS|
|  8|   Matthew Hoffman|2015-01-04|replica| 32|  male|   W|San Francisco|San Francisco|   CA|
|  9| Michael Rodriguez|2015-01-04|  other| 39|  male|   H|        Evans|         Weld|   CO|
+---+------------------+----------+-------+---+------+----+-------------+-------------+-----+
only showing top 5 rows



In [7]:
# Second half of the columns
psDF.select('race_source','mental_illness', 'threat_level', 'flee', 'body_camera', 
            'longitude', 'latitude', 'location_precision', 'agency_ids').show(5)

+-------------+--------------+-----------+----+-----------+---------+-----------+------------------+----------+
|  race_source|mental_illness|threat_type|flee|body_camera|longitude|   latitude|location_precision|agency_ids|
+-------------+--------------+-----------+----+-----------+---------+-----------+------------------+----------+
|not_available|          true|      point| not|      false|47.246826| -123.12159|     not_available|        73|
|not_available|         false|      point| not|      false|45.487423| -122.89169|     not_available|        70|
|not_available|         false|       move| not|      false|37.694767| -97.280556|     not_available|       238|
|not_available|          true|      point| not|      false| 37.76291|-122.422005|     not_available|       196|
|not_available|         false|      point| not|      false|40.383938| -104.69226|     not_available|       473|
+-------------+--------------+-----------+----+-----------+---------+-----------+------------------+----

### Relevant Columns for Joining to Other Datasets

In [8]:
# Select relevant columns
psDF.select('id', 'name', 'date', 'city', 'county', 'state').show(5)

+---+------------------+----------+-------------+-------------+-----+
| id|              name|      date|         city|       county|state|
+---+------------------+----------+-------------+-------------+-----+
|  3|        Tim Elliot|2015-01-02|      Shelton|        Mason|   WA|
|  4|  Lewis Lee Lembke|2015-01-02|        Aloha|   Washington|   OR|
|  5|John Paul Quintero|2015-01-03|      Wichita|     Sedgwick|   KS|
|  8|   Matthew Hoffman|2015-01-04|San Francisco|San Francisco|   CA|
|  9| Michael Rodriguez|2015-01-04|        Evans|         Weld|   CO|
+---+------------------+----------+-------------+-------------+-----+
only showing top 5 rows



### Relevant Columns to the Shooting

In [9]:
# Select relevant columns
psDF.select('name', 'date', 'armed', 'age', 'gender', 'race', 'mental_illness', 'threat_level', 'flee', 'body_camera').show(20)

+--------------------+----------+------------+---+------+----+--------------+------------+----+-----------+
|                name|      date|       armed|age|gender|race|mental_illness| threat_type|flee|body_camera|
+--------------------+----------+------------+---+------+----+--------------+------------+----+-----------+
|          Tim Elliot|2015-01-02|         gun| 53|  male|   A|          true|       point| not|      false|
|    Lewis Lee Lembke|2015-01-02|         gun| 47|  male|   W|         false|       point| not|      false|
|  John Paul Quintero|2015-01-03|     unarmed| 23|  male|   H|         false|        move| not|      false|
|     Matthew Hoffman|2015-01-04|     replica| 32|  male|   W|          true|       point| not|      false|
|   Michael Rodriguez|2015-01-04|       other| 39|  male|   H|         false|       point| not|      false|
|   Kenneth Joe Brown|2015-01-04|         gun| 18|  male|   W|         false|      attack| not|      false|
| Kenneth Arnold Buck|2015-0

### Context of the shooting

In [10]:
# Looking at manner of death and situation details
spark.sql("SELECT armed, race, mental_illness, threat_level, flee, body_camera FROM policeShootings").show(20)

+------------+----+--------------+------------+----+-----------+
|       armed|race|mental_illness| threat_type|flee|body_camera|
+------------+----+--------------+------------+----+-----------+
|         gun|   A|          true|       point| not|      false|
|         gun|   W|         false|       point| not|      false|
|     unarmed|   H|         false|        move| not|      false|
|     replica|   W|          true|       point| not|      false|
|       other|   H|         false|       point| not|      false|
|         gun|   W|         false|      attack| not|      false|
|         gun|   H|         false|       shoot| car|      false|
|         gun|   W|         false|       point| not|      false|
|     unarmed|   W|         false|    accident| not|       true|
|     replica|   B|         false|       point| not|      false|
|       knife|   W|         false|      attack| not|      false|
|         gun|   B|         false|       point| not|      false|
|       knife|   B|      

### Unique Values of Mental Illness & Count

In [11]:
# Number of deaths with potential mental illness
spark.sql("SELECT mental_illness, count(*) as count FROM policeShootings GROUP BY mental_illness ORDER BY count DESC").show()

+--------------+-----+
|mental_illness|count|
+--------------+-----+
|         false| 6988|
|          true| 1787|
+--------------+-----+



### Unique Values of Armed & Count

In [12]:
# Number of deaths, grouped by weapon the individual was carrying
spark.sql("SELECT armed, count(*) as count FROM policeShootings GROUP BY armed ORDER BY count DESC").show()

+--------------------+-----+
|               armed|count|
+--------------------+-----+
|                 gun| 5086|
|               knife| 1472|
|             unarmed|  516|
|        undetermined|  348|
|             vehicle|  309|
|             replica|  288|
|        blunt_object|  216|
|                null|  210|
|             unknown|  137|
|               other|   88|
|         gun;vehicle|   38|
|           gun;knife|   35|
|         vehicle;gun|   15|
|           other;gun|    4|
|       knife;vehicle|    3|
|  blunt_object;knife|    2|
|  knife;blunt_object|    2|
|blunt_object;blun...|    2|
|       replica;knife|    1|
|other;blunt_objec...|    1|
+--------------------+-----+
only showing top 20 rows



### Unique Values of Flee & Count

In [13]:
# Number of deaths by method of fleeing
spark.sql("SELECT flee, count(*) as count FROM policeShootings GROUP BY flee ORDER BY count DESC").show()

+-----+-----+
| flee|count|
+-----+-----+
|  not| 4705|
|  car| 1404|
| null| 1192|
| foot| 1137|
|other|  337|
+-----+-----+



### Unique Values of Race & Count

In [14]:
# Number of deaths by race
policeShootingsNorm = spark.sql("""
SELECT race, count(*) as count FROM 
            (SELECT 
                CASE
                    WHEN race = 'A' THEN 'Asian'
                    WHEN race = 'B' THEN 'Black'
                    WHEN race = 'N' THEN 'Native'
                    WHEN race = 'H' THEN 'Hispanic'
                    WHEN race = 'W' THEN 'White'
                    WHEN race = 'O' THEN 'Other'
                    WHEN race = 'B;H' THEN 'Black and Hispanic'
                    ELSE 'Not Documented'
                END as race
                FROM    
                  policeShootings)
                GROUP BY race ORDER BY count DESC

""")
policeShootingsNorm.show()

+------------------+-----+
|              race|count|
+------------------+-----+
|             White| 3772|
|             Black| 1994|
|    Not Documented| 1409|
|          Hispanic| 1315|
|             Asian|  146|
|            Native|  117|
|             Other|   21|
|Black and Hispanic|    1|
+------------------+-----+



### Unique Values of  Race Source & Count

In [15]:
# Race Source; how the race of the individual was obtained
spark.sql("SELECT race_source, count(*) as count FROM policeShootings GROUP BY race_source ORDER BY count DESC").show()

+-------------+-----+
|  race_source|count|
+-------------+-----+
|not_available| 6204|
|         null| 1389|
|        photo|  545|
|public_record|  520|
|         clip|   97|
| undetermined|   16|
|        other|    4|
+-------------+-----+



In [16]:
policeShootingsNorm = spark.sql("""
        SELECT 
            name,
            armed,
            mental_illness,
            threat_level,
            flee,
            city,
            body_camera,
            CASE
                WHEN race = 'A' THEN 'Asian'
                WHEN race = 'B' THEN 'Black'
                WHEN race = 'N' THEN 'Native'
                WHEN race = 'H' THEN 'Hispanic'
                WHEN race = 'W' THEN 'White'
                WHEN race = 'O' THEN 'Other'
                WHEN race = 'B;H' THEN 'Black and Hispanic'
                ELSE 'Not Documented'
            END as race
        FROM    
          policeShootings

""")
policeShootingsNorm.createOrReplaceTempView('policeShootingsNorm')

### Exploring Records Where Race Was Not Document

In [17]:
# Shootings where race of victim isn't documented
spark.sql("""
    SELECT name, armed, race, mental_illness, threat_level, flee, body_camera 
    FROM policeShootingsNorm 
    WHERE race = 'Not Documented'
""").show()

+--------------------+-------+--------------+--------------+-----------+-----+-----------+
|                name|  armed|          race|mental_illness|threat_type| flee|body_camera|
+--------------------+-------+--------------+--------------+-----------+-----+-----------+
|    William Campbell|    gun|Not Documented|         false|      point|  not|      false|
|  John Marcell Allen|    gun|Not Documented|         false|      shoot|  not|      false|
|          Mark Smith|   null|Not Documented|         false|     attack|other|      false|
|          Joseph Roy|  knife|Not Documented|          true|     threat|  not|      false|
|James Anthony Morris|    gun|Not Documented|          true|      shoot|  not|      false|
|       James Johnson|    gun|Not Documented|          true|      point|  not|      false|
|    Raymond Phillips|    gun|Not Documented|          true|      point|  not|      false|
|       Brian Johnson|  other|Not Documented|          true|     attack|  not|      false|

In [18]:
# Shootings where victims fled status wasn't documented
spark.sql('SELECT name, armed, race, mental_illness, threat_level, flee, body_camera FROM policeShootingsNorm where flee is NULL').show()

+--------------------+------------+--------------+--------------+------------+----+-----------+
|                name|       armed|          race|mental_illness| threat_type|flee|body_camera|
+--------------------+------------+--------------+--------------+------------+----+-----------+
|      Ernesto Gamino|undetermined|      Hispanic|         false|undetermined|null|      false|
|   Randy Allen Smith|         gun|         Black|         false|       point|null|      false|
|     Zachary Grigsby|         gun|         White|         false|       shoot|null|      false|
|         Roy Carreon|       knife|      Hispanic|         false|      attack|null|      false|
|   Efrain Villanueva|     unknown|Not Documented|         false|      attack|null|      false|
|        Bettie Jones|     unarmed|         Black|         false|    accident|null|      false|
|  John Randell Veach|undetermined|Not Documented|         false|undetermined|null|      false|
|John Alan Chamber...|undetermined|Not D

### San Diego City, California

In [19]:
spark.sql("""
    SELECT name, armed, race, mental_illness, threat_level, flee, body_camera 
    FROM policeShootingsNorm 
    WHERE city = 'San Diego'
""").show()

+--------------------+------------+--------------+--------------+-----------+-----+-----------+
|                name|       armed|          race|mental_illness|threat_type| flee|body_camera|
+--------------------+------------+--------------+--------------+-----------+-----+-----------+
|Fridoon Zalbeg Nehad|     unarmed|         Other|          true|       move|  not|       true|
|         Dennis Fiel|         gun|         White|          true|      shoot| foot|       true|
|          Ton Nguyen|       knife|         Asian|         false|     threat|  not|      false|
|        Robert Hober|       knife|         White|         false|     threat|  not|       true|
|      Lamontez Jones|     replica|         Black|         false|      point|  not|      false|
|       Joshua Sisson|       knife|         White|         false|     threat|  not|       true|
|Thongsoune Vilaysane|        null|         Asian|         false|       move|  car|       true|
|Juan Carlos Ferna...|         gun|     

## Police Shooting (Agencies) Data 

In [20]:
# Schema of the Police Shootings Agency Dataset (Oct 2023)
psaSchema = StructType([\
                       StructField('id', IntegerType(), False),
                       StructField('name', StringType(), True),
                       StructField('type', StringType(), True),
                       StructField('state', StringType(), True),
                       StructField('oricodes', StringType(), True),
                       StructField('total_shootings', IntegerType(), True)
                        ])                     

In [21]:
# Creating Dataframe and Temp View
psaDF = spark.read.option('header', 'True').schema(psaSchema).csv(config['pathways']['police_shootings_agencies'])
psaDF.createOrReplaceTempView('policeShootingsAgencies')

### Preview of the Data

In [22]:
# First half of the columns
psaDF.select('id', 'name', 'type', 'state', 'oricodes', 'total_shootings').show(5)

+----+--------------------+------------+-----+--------+---------------+
|  id|                name|        type|state|oricodes|total_shootings|
+----+--------------------+------------+-----+--------+---------------+
|3145|Abbeville County ...|     sheriff|   SC| SC00100|              1|
|2576|Aberdeen Police D...|local_police|   WA| WA01401|              1|
|2114|Abilene Police De...|local_police|   TX| TX22101|              3|
|2088|Abington Township...|local_police|   PA| PA04601|              1|
|3187|Acadia Parish She...|     sheriff|   LA| LA00100|              1|
+----+--------------------+------------+-----+--------+---------------+
only showing top 5 rows



### Agency with the Most Recorded Shootings

In [23]:
spark.sql("""
    SELECT id, name, type, state, total_shootings 
    FROM policeShootingsAgencies
    ORDER BY total_shootings DESC
""").show(10)

+---+--------------------+------------+-----+---------------+
| id|                name|        type|state|total_shootings|
+---+--------------------+------------+-----+---------------+
| 38|Los Angeles Polic...|local_police|   CA|            129|
| 80|Phoenix Police De...|local_police|   AZ|            109|
| 20|Los Angeles Count...|     sheriff|   CA|            103|
|102|Houston Police De...|local_police|   TX|             76|
|298|New York Police D...|local_police|   NY|             65|
|375|Las Vegas Metropo...|local_police|   NV|             64|
| 44|San Antonio Polic...|local_police|   TX|             60|
| 19|Pennsylvania Stat...|state_police|   PA|             57|
|266|California Highwa...|state_police|   CA|             55|
|267|Riverside County ...|     sheriff|   CA|             53|
+---+--------------------+------------+-----+---------------+
only showing top 10 rows



### State with the Most Total Shootings By Department Type

In [24]:
# Local Police
spark.sql("""
    SELECT state, type, sum(total_shootings) as total_shootings
    FROM policeShootingsAgencies
    WHERE type = 'local_police'
    GROUP BY state, type
    ORDER BY total_shootings DESC
""").show(10)

+-----+------------+---------------+
|state|        type|total_shootings|
+-----+------------+---------------+
|   CA|local_police|            805|
|   TX|local_police|            588|
|   AZ|local_police|            314|
|   FL|local_police|            239|
|   CO|local_police|            238|
|   GA|local_police|            179|
|   OH|local_police|            176|
|   MO|local_police|            174|
|   OK|local_police|            172|
|   WA|local_police|            162|
+-----+------------+---------------+
only showing top 10 rows



In [25]:
# Sheriff
spark.sql("""
    SELECT state, type, sum(total_shootings) as total_shootings
    FROM policeShootingsAgencies
    WHERE type = 'sheriff'
    GROUP BY state, type
    ORDER BY total_shootings DESC
""").show(10)

+-----+-------+---------------+
|state|   type|total_shootings|
+-----+-------+---------------+
|   CA|sheriff|            399|
|   FL|sheriff|            324|
|   TX|sheriff|            177|
|   GA|sheriff|            155|
|   NC|sheriff|            114|
|   TN|sheriff|            109|
|   SC|sheriff|            104|
|   LA|sheriff|            100|
|   WA|sheriff|             84|
|   CO|sheriff|             82|
+-----+-------+---------------+
only showing top 10 rows



In [26]:
# State Police
spark.sql("""
    SELECT state, type, sum(total_shootings) as total_shootings
    FROM policeShootingsAgencies
    WHERE type = 'state_police'
    GROUP BY state, type
    ORDER BY total_shootings DESC
""").show(10)

+-----+------------+---------------+
|state|        type|total_shootings|
+-----+------------+---------------+
|   PA|state_police|             58|
|   TX|state_police|             57|
|   CA|state_police|             55|
|   KY|state_police|             48|
|   AK|state_police|             30|
|   OR|state_police|             26|
|   GA|state_police|             25|
|   NM|state_police|             25|
|   MI|state_police|             25|
|   WV|state_police|             23|
+-----+------------+---------------+
only showing top 10 rows



In [27]:
# Federal
spark.sql("""
    SELECT state, type, sum(total_shootings) as total_shootings
    FROM policeShootingsAgencies
    WHERE type = 'federal'
    GROUP BY state, type
    ORDER BY total_shootings DESC
""").show(10)

+-----+-------+---------------+
|state|   type|total_shootings|
+-----+-------+---------------+
|   TX|federal|             31|
|   AZ|federal|             17|
|   CA|federal|             15|
|   NM|federal|             12|
|   US|federal|             10|
|   TN|federal|              9|
|   MO|federal|              8|
|   OH|federal|              7|
|   MT|federal|              7|
|   PA|federal|              7|
+-----+-------+---------------+
only showing top 10 rows



In [28]:
# San Diego (From 2015 to 2023)
spark.sql("""
    SELECT name state, type, total_shootings
    FROM policeShootingsAgencies
    WHERE name like '%San Diego%'
    ORDER BY total_shootings DESC
""").show(10)

+--------------------+------------+---------------+
|               state|        type|total_shootings|
+--------------------+------------+---------------+
|San Diego Police ...|local_police|             38|
|San Diego County ...|     sheriff|             17|
+--------------------+------------+---------------+



## Joining Police Shootings & Agency Data

In [29]:
spark.sql("""
    SELECT ps.name as victim_name, psa.name as agency_name, psa.state, 
        psa.type as department_type, psa.total_shootings as department_total_shootings
    FROM policeShootings as ps
    JOIN policeShootingsAgencies as psa
    ON ps.agency_ids = psa.id
""").show(20)

+--------------------+--------------------+-----+---------------+--------------------------+
|         victim_name|         agency_name|state|department_type|department_total_shootings|
+--------------------+--------------------+-----+---------------+--------------------------+
| Evin Kimberly Payne|Abbeville County ...|   SC|        sheriff|                         1|
|Kristopher Fitzpa...|Aberdeen Police D...|   WA|   local_police|                         1|
|        Kevin Greene|Abilene Police De...|   TX|   local_police|                         3|
|    Lebarron Ballard|Abilene Police De...|   TX|   local_police|                         3|
|Michael Leroy McG...|Abilene Police De...|   TX|   local_police|                         3|
|    Angel Luis Ortiz|Abington Township...|   PA|   local_police|                         1|
|                null|Acadia Parish She...|   LA|        sheriff|                         1|
|   Gabriel Scott Rau|Acworth Police De...|   GA|   local_police|     

In [None]:
spark.stop()