## Criminal Justice Analytics
### Applying the Data Science Methodology as an assessment to the practicality of analyzing, visualizing, and reporting on data as a method to describe crimes, diagnose scenarios that support criminal behavior, predict criminal activity, and prescribe scenarios that lower the probability of criminal activity.
---
#### Princeton Brooke
###### Data Scientist and Cyber Security Software Engineer
###### Website & Software Engineering Incorporated, Cleveland, Ohio

###### Sunday, November 18, 2018
###### This report is published for the IBM Data Science Professional Certification Coursera Program


## Executive Summary
#### By applying the data science methodology to the field of criminal justice - we can attempt to use descriptive analytics to answer what crimes have occurred, diagnostic analytics to answer why did a crime occur, predictive analytics to gain insights for what crimes might occur, and prescriptive analytics to answer the question of how to re-enact or possibly make the crime occur again.
#### The data science methodology can be used to aid crime fighters. Businesses have used analytics for decades in order to gain competitive advantage. Manufacturers are using analytics to predict when robotic machines will fail, decrease assembly-line downtime, and anticipate maintenance requests. By collecting the right data, applying statistical algorithms to visualize and build models, and delivering actionable reporting dashboards - crime fighters can be more effective at anticipating criminal behavior, stopping crimes before they happen, lessening the impact of crimes, improving response times, or recovering after crimes occur.

---
## Literature Review
#### A notable proponent of applying data science within the criminal justice realm is Anne Milgram. A TED Talks session was published on January 28, 2014 in which Anne addressed her passion of “Money Balling” criminal justice. She explains that data can be used to reduce crimes and reallocate resources towards projects that improve public safety.
#### At the time of her presentation statistics showed that there were 12 million arrests [per year] where 70-80% are low level crimes and less than 5% are attributed as violent crimes.
#### 75-Billion dollars is spent on state and local corrections costs while 2.3 million people await in jail or prison. 2/3 are waiting for trial where they have not been convicted and there is a 67% recidivism rate which means 7/10 people are arrested multiple times.
#### Highest risk offenders are being released because judges are not using data to drive decisions. Anne found that 5/10 criminal justice jurisdictions utilize analytics tools. The cost of implementing analytics projects is costly. A universal risk assessment tool was built that is easy to use. The tool can predict whether someone will commit a new crime upon release, predict if an act of violence will be committed, and predict whether a person will reappear in court.
#### The data collected within the Public Safety Assessment Dashboard includes the following:
#### Defendants biographical details, current offense type, whether the defendant was under the age of 21 or had pending charges at the time of offense, prior convictions such as misdemeanor, felony, violent, or sentenced to incarceration, and any prior failures to appear. A Pretrial Assessment Dashboard processes the data and displays a New Criminal Activity (NCA) Score and a Failure to Appear (FTA) Score.


## Data & Methodology
#### This project will focus on data acquired from the city of Cleveland, Ohio. I will also gather geo-location data on the Pokemon Go Pokestop hubs within the Greater Cleveland region and use the Foursquare API to query social media checkin data.
---
#### Before we can begin to answer core questions on this topic, we must conduct analytical review. We must ask questions about crimes, criminal behaviors, and inquire about the data that is being collected. Is the data accessible and consistent? Does the data provide enough features to visualize a state that can clearly communicate ideas? Can the database be altered to improve to allow the collection of future features so that we can gain additional insights?
#### This report seeks to examine criminal data collected from the city of Cleveland, Ohio and to determine if it correlates with geo-location check-ins from social media. We will compare the same crimes data with check-ins that are tagged with Pokémon Pokestops to see if the outcomes shift or remain consistent. We will be using the FourSquare API, to acquire location-based data, and the Cleveland crimes data is published by the city of Cleveland.


In [28]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import requests # library to handle requests

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)

import numpy as np # library to handle data in a vectorized manner
import re # import library for regular expression
import random # library for random number generation

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.17.0                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Folium installed
Libraries imported.


In [40]:
# Cleveland Crimes Data

from pandas import read_excel
my_sheet_name = 'Ohio Crime By County 2016' 
cleveland_crimes_df = read_excel('http://www.publicsafety.ohio.gov/links/ocjs-crime-by-county2016.xlsx', sheet_name = my_sheet_name)
print(cleveland_crimes_df.head()) # shows headers with top 5 rows


                            AGENCY NAME  POPULATION  VIOLENT CRIME  \
0                                   NaN         NaN            NaN   
1  OHIO DEPARTMENT OF NATURAL RESOURCES         NaN            2.0   
2             OHIO STATE HGHWAY PATROL          NaN          291.0   
3                                 TOTAL         NaN          293.0   
4                                   NaN         NaN            NaN   

   PROPERTY CRIME  MURDER  RAPE  ROBBERY  AGGRAVATED ASSAULT  BURGLARY  \
0             NaN     NaN   NaN      NaN                 NaN       NaN   
1            17.0     NaN   NaN      1.0                 1.0       2.0   
2           233.0     1.0  35.0      4.0               251.0      12.0   
3           250.0     1.0  35.0      5.0               252.0      14.0   
4             NaN     NaN   NaN      NaN                 NaN       NaN   

   LARCENY  MTR VEHICLE THEFT  ARSON  Unnamed: 12 Unnamed: 13  Unnamed: 14  \
0      NaN                NaN    NaN          NaN       

In [41]:
cleveland_crimes_df.describe(include=['object'])

Unnamed: 0,AGENCY NAME,Unnamed: 13,Unnamed: 16,Unnamed: 18
count,768,2,2,3
unique,710,1,2,3
top,SPRINGFIELD TOWNSHIP,12*,OH07609,MINERVA
freq,3,2,1,1


In [42]:
cleveland_crimes_df['VIOLENT CRIME'].value_counts()

1.0        77
2.0        42
3.0        41
6.0        27
5.0        24
4.0        23
14.0       18
7.0        15
9.0        12
11.0       12
10.0       12
13.0       10
20.0       10
17.0       10
8.0         9
24.0        9
22.0        8
15.0        8
12.0        8
45.0        7
19.0        7
16.0        6
30.0        6
32.0        6
26.0        6
75.0        5
39.0        5
27.0        5
31.0        5
29.0        5
           ..
107.0       1
127.0       1
87.0        1
240.0       1
91.0        1
93.0        1
2720.0      1
200.0       1
32703.0     1
84.0        1
254.0       1
94.0        1
120.0       1
41.0        1
46.0        1
116.0       1
99.0        1
72.0        1
108.0       1
1072.0      1
1216.0      1
228.0       1
58.0        1
85.0        1
585.0       1
293.0       1
218.0       1
98.0        1
460.0       1
3463.0      1
Name: VIOLENT CRIME, Length: 134, dtype: int64

In [4]:
# The code was removed by Watson Studio for sharing.

In [5]:
# If the criminal database includes addresses we can attempt to use the code within this cell

address = '230 W Huron Rd, Cleveland, OH 44113'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)



41.4969221 -81.6934307


In [20]:
# Our data provided by the city of Cleveland only includes county level name so we will manually provide location
latitude = 41.43
longitude = -81.67

## 1. FourSquare API Call
> `https://api.foursquare.com/v2/venues/`**search**`?client_id=`**CLIENT_ID**`&client_secret=`**CLIENT_SECRET**`&ll=`**LATITUDE**`,`**LONGITUDE**`&v=`**VERSION**`&query=`**QUERY**`&radius=`**RADIUS**`&limit=`**LIMIT**

In [23]:
search_query = 'pokestop'
radius = 500
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)

In [24]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5bf25a03351e3d0bc3b1fd2b'},
 'response': {'venues': []}}

In [25]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
dataframe.head()

In [26]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered

KeyError: "None of [['name', 'categories', 'id']] are in the [columns]"

In [27]:
# Visualize the data
dataframe_filtered.name

NameError: name 'dataframe_filtered' is not defined

---
## Results
#### The criminal justice system can use analytics to streamline processes and ultimately become more effective in the way crimes are analyzed. Judges can use data to make evidence-based decisions about the people seeking to be released from jail or prison.
#### Due to my limited capacity to acquire enough data, the sources did not provide enough features to expand on the outcomes - at the time of this report. The city of Cleveland has not provided the public with the new database required to run deep analysis and is currently undergoing a revision to the way the city acquires and publishes its crimes database. There was data aggregated by county; however, in order to make a complete assessment of the crimes and to build a thorough understanding of how those crimes might relate to social media check-ins or to geo-location community-based games, such as Pokémon Go, additional features would require time and date of crime along with location including latitude and longitude.
#### A future assessment would require additional features based on Pokémon Go. Collecting the latitude and longitude of every Pokémon Pokestop and the Pokémon Gyms would expand the research in several ways. This new dataset can be aggregated and clustered to visualize crimes per neighborhood, city, as well as at the county level.


---
## Acknowledgements and References
#### Case Western Reserve University
###### ...........          Neocando – Neighborhood Data Warehouse
###### ...........          Center on Urban Poverty and Community Development
###### ...........          http://neocando.case.edu/neocando/

#### FourSquare Developers API
###### ...........          https://developer.foursquare.com/

#### Getting Started with Data Science
###### ...........          Murtaza Haider, IBM Press
###### ...........          https://www.amazon.com/Getting-Started-Data-Science-Analytics/dp/0133991024

#### US City Open Data Census
###### ...........          http://us-city.census.okfn.org/dataset/crime-stats
