# South Australia Crime Data 
## Data 
Data is courtesy of **South Australian Government Data Directory & SAPOL**. 

[CRIME DATA](https://data.sa.gov.au/data/dataset/crime-statistics) 
[SUBURB DATA](https://data.sa.gov.au/data/dataset/suburb-boundaries/resource/58b3b8ef-f292-4e27-a7bc-215ad7670cda)

## Problem
Clean, Visualise and identify South Australian Crime data trends
https://monkeylearn.com/data-cleaning/
https://monkeylearn.com/blog/data-cleaning-python/

### Cleaning: Will be split into three categoties
1. **Data exploring:** Understanding the data, noting any changes that need to be implmented on the data and identifying patterns, elements and relationships
2. **Data Filtering:**
3. **Data Cleaning:** 

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_crime = pd.read_csv('resources/2018-19-data_sa_crime.csv')
data = pd.DataFrame(data_crime)
data_suburb = pd.read_csv('resources/SASuburbs.csv')
suburb = pd.DataFrame(data_suburb)
data.head()


Unnamed: 0,Reported Date,Suburb - Incident,Postcode - Incident,Offence Level 1 Description,Offence Level 2 Description,Offence Level 3 Description,Offence count
0,1/07/2018,ABERFOYLE PARK,5159,OFFENCES AGAINST PROPERTY,THEFT AND RELATED OFFENCES,Theft from motor vehicle,1.0
1,1/07/2018,ADELAIDE,5000,OFFENCES AGAINST PROPERTY,PROPERTY DAMAGE AND ENVIRONMENTAL,Other property damage and environmental,1.0
2,1/07/2018,ADELAIDE,5000,OFFENCES AGAINST PROPERTY,THEFT AND RELATED OFFENCES,Other theft,5.0
3,1/07/2018,ADELAIDE,5000,OFFENCES AGAINST PROPERTY,THEFT AND RELATED OFFENCES,Receive or handle proceeds of crime,1.0
4,1/07/2018,ADELAIDE,5000,OFFENCES AGAINST PROPERTY,THEFT AND RELATED OFFENCES,Theft from motor vehicle,1.0


# Locate missing data 
Locate missing data using the python funciton `isnull()`, which will identify every where within our data set where there is missing values.This function will return a list where `True` indicates a missing value. We then use the `sum()` function will return the number of missing elements per column 

In [3]:
data.isnull().sum()

Reported Date                    1
Suburb - Incident              197
Postcode - Incident            373
Offence Level 1 Description      1
Offence Level 2 Description      1
Offence Level 3 Description      1
Offence count                    1
dtype: int64

# Cleaning
Looking at the above table, it is evident that the key columns of this data set are:
1. Reported Date 
2. Postcode - Incident
3. Offence Level 1 Description
4. Offence count

## Replacing missing data
- Missing data within the *Suburb - Incident* row can be filled assuming the post code is known. 

- We will not be filling *Postcode - Incident* information that is missing as multiple postcodes can have the same or differnt suburb name i.e. *Postcode - Incident* is an **identifier** where as *Suburb - Incident* is an **identity**

## Removing data
- Any column which is missing the aforementioned four missing values will be removed as they are critical to this analysis


In [4]:
# remove empty row
remove = ['Postcode - Incident','Reported Date','Offence count','Offence Level 1 Description']
data.dropna(subset=remove, inplace=True)
data.isnull().sum()

Reported Date                  0
Suburb - Incident              6
Postcode - Incident            0
Offence Level 1 Description    0
Offence Level 2 Description    0
Offence Level 3 Description    0
Offence count                  0
dtype: int64

# Clean suburbs data 
The suburbs datas set only needs two rows, being the zipcode row and the suburb name row.


In [5]:
suburb.head()

Unnamed: 0,_id,postcode,suburb,suburb_num,legalstart,shape_Leng,shape_Area
0,1,872,AMATA,87206,00000000,0.258469,0.003274
1,2,872,ANANGU PITJANTJATJARA YANKUNYTJATJARA,87205,00000000,16.039186,9.288729
2,3,872,AYERS RANGE SOUTH,87202,26/04/2013,1.466852,0.117333
3,4,872,DE ROSE HILL,87201,26/04/2013,1.685673,0.167839
4,5,872,IWANTJA,87209,00000000,0.143976,0.001144


In [9]:
remove = ['_id','suburb_num','legalstart','shape_Leng','shape_Area']
suburb.drop(remove, inplace = True, axis =1)
suburb.head()

KeyError: "['_id' 'suburb_num' 'legalstart' 'shape_Leng' 'shape_Area'] not found in axis"

In [8]:
suburb.isnull().sum()

postcode    0
suburb      0
dtype: int64

# Update Missing Data 
Here we will add zipcodes to all missing suburbs within our `data` dataframe.

To do this we create transform the `subrurb` dataframe into a dictionary where the key is the suburb name and the value is the suburb postcode.
