In [1]:
!python --version

Python 3.5.4 :: Anaconda, Inc.


In [30]:
import sys
print ("Python {}.{}.{}".format(sys.version_info[0],sys.version_info[1], sys.version_info[2]))

Python 3.5.4


## Data Drive Justice
---

### Guideline
We are going to collect and analyze data about crimes that would help us to: 
- Reveal patterns relevant to criminal justice, When and where will the next crime take place in a specific region or community
- Identify interesting uses of these data e.g. identify discriminatory practices or predict future crime events
- Crime Classification: How does the socio-economic background of a person influence their probability of committing a crime? How does the socio-economic background of the criminal relate to the type of crime committed? 
- Perception of safety among citizens: What factors influence the perception of safety for European citizens?
- Investigate the relationship between perception of crime and actual crime statistics within communities.
- Can we explain regional differences, e.g. by looking at the demographics, prevalence of political affiliation or other interesting factors?

### Data Science Pipeline
---

Exemplary Procedure Outline:
Loop 1 (Data Exploration)
Specify a set of research questions
Create a data pipeline strategy
Find and collect datasets 
Have a look at descriptive statistics 
Create new features as needed 
With the data on hand, continue to ask questions:
Do we still have the same research questions?
De we still have the same strategy?
Shall we find other sources?
​
Loop 2 (Data Integration)
Integrate raw data
Write scripts for cleaning and aggregation
Feature selection
Variable interpretation
Find correlation between variables
With the summary on hand, continue to ask question:
Do we need different variables?
Do we need to normalize our features?
Are we still confident we can answer the research question?
​
Loop 3 (Data Analysis)
Identify the method to respond the target variables 
If is a crime pattern then: Anomaly Detection
If is a crime circle then: Time Series
If is a crime category then: Classification
If is the quantity of crimes then: Regression
If is a concept and not a target variable then: Density unsupervised
Etc. (All options enunciative but not limitative)
Design a concept to visualize the outcome
Choose and develop the algorithm (e.g. if Classification, SVM or Bayes?)
Plot the outcomes
With the summary on hand, ask…
Does the outcome makes sense?
What does the important variables mean? (on “human” language)
How improve the outcome? (add more target variables? Change it? Change the parameters? Change the method?)
​
Loop 4 (Data Communication)
Document the procedures
Versioning control of script for replicable science
Develop a Data Storytelling
With the Data-story (Notebook, Tableau story, R-Markdown, Slides, etc) on hand, ask… What’s next? 
Improve the model?
Improve the data?
Join more researchers to the problem?


In [1]:
import pandas as pd
import requests
import json
import matplotlib.pyplot as plt
%matplotlib inline

### JSON Editor
http://jsoneditoronline.org/

### Data Portal
https://data.overheid.nl/data/dataset?q=crime&sort=score+desc%2C+modified+desc%2C+metadata_modified+desc

### Deaths; murder and manslaughter, crime scene in The Netherlands
This table contains the number of persons died as a result of murder or manslaughter, where the crime scene is located in the Netherlands. The victims can be residents or non-residents. The data can be split by location of the crime, method, age and sex. The date of death is the criterion, the date of the act can be in the previous year. The ICD10 codes that belong to murder and manslaughter are X85-Y09.  
https://data.overheid.nl/data/dataset/deaths-murder-and-manslaughter-crime-scene-in-the-netherlands

In [66]:
r = requests.get("http://opendata.cbs.nl/ODataApi/OData/81453ENG")
service = json.loads(r.text)
for i in service['value']:
    print (i)

{'url': 'http://opendata.cbs.nl/ODataApi/OData/81453ENG/TableInfos', 'name': 'TableInfos'}
{'url': 'http://opendata.cbs.nl/ODataApi/OData/81453ENG/UntypedDataSet', 'name': 'UntypedDataSet'}
{'url': 'http://opendata.cbs.nl/ODataApi/OData/81453ENG/TypedDataSet', 'name': 'TypedDataSet'}
{'url': 'http://opendata.cbs.nl/ODataApi/OData/81453ENG/DataProperties', 'name': 'DataProperties'}
{'url': 'http://opendata.cbs.nl/ODataApi/OData/81453ENG/CategoryGroups', 'name': 'CategoryGroups'}
{'url': 'http://opendata.cbs.nl/ODataApi/OData/81453ENG/Sex', 'name': 'Sex'}
{'url': 'http://opendata.cbs.nl/ODataApi/OData/81453ENG/Age', 'name': 'Age'}
{'url': 'http://opendata.cbs.nl/ODataApi/OData/81453ENG/Periods', 'name': 'Periods'}


### Crimes, type of crime
This table contains figures on the number of registered crimes. These are broken down according to the type of crime, including figures on High Impact Crimes (theft / burglary, violent crimes, robberies, street robbery). The crime types shown are a selection of all crime types, and do not add up to the Total number of crimes.
Data available from: 2014  
https://data.overheid.nl/data/dataset/misdrijven-soort-misdrijf

In [55]:
r = requests.get("https://dataderden.cbs.nl/ODataApi/OData/47005NED")
service = json.loads(r.text)
for i in service['value']:
    print (i)

{'url': 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/TableInfos', 'name': 'TableInfos'}
{'url': 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/UntypedDataSet', 'name': 'UntypedDataSet'}
{'url': 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/TypedDataSet', 'name': 'TypedDataSet'}
{'url': 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/DataProperties', 'name': 'DataProperties'}
{'url': 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/CategoryGroups', 'name': 'CategoryGroups'}
{'url': 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/SoortMisdrijf', 'name': 'SoortMisdrijf'}
{'url': 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/Perioden', 'name': 'Perioden'}


In [56]:
values = []
for value in service['value']:
    values.append(value['url'])
    values.append(value['name'])
    
df = pd.DataFrame([values]).T

In [57]:
values

['https://dataderden.cbs.nl/ODataApi/OData/47005NED/TableInfos',
 'TableInfos',
 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/UntypedDataSet',
 'UntypedDataSet',
 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/TypedDataSet',
 'TypedDataSet',
 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/DataProperties',
 'DataProperties',
 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/CategoryGroups',
 'CategoryGroups',
 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/SoortMisdrijf',
 'SoortMisdrijf',
 'https://dataderden.cbs.nl/ODataApi/OData/47005NED/Perioden',
 'Perioden']

## Normalize

In [58]:
from pandas.io.json import json_normalize

In [67]:
df_meta = json_normalize(service['value'])

In [68]:
df_meta

Unnamed: 0,name,url
0,TableInfos,http://opendata.cbs.nl/ODataApi/OData/81453ENG...
1,UntypedDataSet,http://opendata.cbs.nl/ODataApi/OData/81453ENG...
2,TypedDataSet,http://opendata.cbs.nl/ODataApi/OData/81453ENG...
3,DataProperties,http://opendata.cbs.nl/ODataApi/OData/81453ENG...
4,CategoryGroups,http://opendata.cbs.nl/ODataApi/OData/81453ENG...
5,Sex,http://opendata.cbs.nl/ODataApi/OData/81453ENG...
6,Age,http://opendata.cbs.nl/ODataApi/OData/81453ENG...
7,Periods,http://opendata.cbs.nl/ODataApi/OData/81453ENG...


In [69]:
df_meta.index = df_meta['name']

In [70]:
#TableInfos
tableinfos = df_meta.loc['TableInfos']['url']

In [71]:
tableinfos

'http://opendata.cbs.nl/ODataApi/OData/81453ENG/TableInfos'

In [72]:
r_tableinfos = requests.get(tableinfos)
json_normalize(json.loads(r_tableinfos.text)['value']).T

Unnamed: 0,0
Catalog,CBS
DefaultPresentation,"graphtype=Table&r=Age,Periods&k=Topics&t=Sex"
DefaultSelection,$filter=((Sex eq '1100')) and ((Age eq '10000'...
Description,CONTENTS\r\n\r\n1. General information\r\n2. D...
ExplanatoryText,
Frequency,Yearly
GraphTypes,"Table,Bar,Line"
ID,0
Identifier,81453ENG
Language,en
