# Explore UK Crime Data with Pandas and GeoPandas


## Table of Contents

1. [Introduction to GeoPandas](#geopandas)<br>
2. [Getting ready](#ready)<br>
3. [London boroughs](#boroughs)<br>
    3.1. [Load data](#load1)<br>
    3.2. [Explore data](#explore1)<br>
4. [Crime data](#crime)<br>
    4.1. [Load data](#load2)<br>
    4.2. [Explore data](#explore2)<br>
5. [OSM data](#osm)<br>
    5.1. [Load data](#load3)<br>
    5.2. [Explore data](#explore3)<br>

In [None]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString, Polygon
import matplotlib.pyplot as plt
from datetime import datetime

%matplotlib inline

<a id="geopandas"></a>
## 1. Introduction to GeoPandas

> If you have not used Pandas before, please read through this [10 minute tutorial](http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) or check out this [workshop](https://github.com/IBMDeveloperUK/pandas-workshop/blob/master/README.md).

A GeoDataSeries or GeoDataFrame is very similar to a Pandas DataFrame, but has an additional column with the geometry. You can load a file, or create your own:

In [None]:
df = pd.DataFrame({'city':       ['London','Manchester','Birmingham','Leeds','Glasgow'],
        'population': [9787426,  2553379,     2440986,    1777934, 1209143],
        'area':       [1737.9,   630.3,       598.9,      487.8,   368.5 ],
        'latitude':   [51.50853, 53.48095,    52.48142,   53.79648,55.86515],
        'longitude':  [-0.12574, -2.23743,    -1.89983,   -1.54785,-4.25763]})

df['geometry']  = list(zip(df.longitude, df.latitude))

df['geometry'] = df['geometry'].apply(Point)

cities = gpd.GeoDataFrame(df, geometry='geometry')
cities.head()

Creating a basic map is similar to creating a plot from a Pandas DataFrame:

In [None]:
cities.plot(column='population');

As `cities` is a DataFrame you can apply data manipulations, for instance:

In [None]:
cities['population'].mean()

### Points vs Lines vs Polygons

We need some more data! Create Points by squeezing out the geometry for each city:

In [None]:
lon_point = cities.loc[cities['city'] == 'London', 'geometry'].squeeze()
man_point = cities.loc[cities['city'] == 'Manchester', 'geometry'].squeeze()
birm_point = cities.loc[cities['city'] == 'Birmingham', 'geometry'].squeeze()
leeds_point = cities.loc[cities['city'] == 'Leeds', 'geometry'].squeeze()

Lines between 2 cities by creating a LineString between 2 points:

In [None]:
lon_man_line = gpd.GeoSeries(LineString([lon_point, man_point]))
man_birm_line = gpd.GeoSeries(LineString([man_point, birm_point]))
birm_lon_line = gpd.GeoSeries(LineString([birm_point,lon_point]))
leeds_man_line = gpd.GeoSeries(LineString([leeds_point, man_point]))
birm_leeds_line = gpd.GeoSeries(LineString([birm_point,leeds_point]))

A polygon between 3 cities by creating a Polygon between 3 points:

In [None]:
Polygon([[lon_point.x,lon_point.y],[man_point.x,man_point.y],[lon_point.x,lon_point.y]])
lon_man_birm_polygon = gpd.GeoSeries(Polygon([[lon_point.x,lon_point.y],[man_point.x,man_point.y],[birm_point.x,birm_point.y],[lon_point.x,lon_point.y]]))
leeds_man_birm_polygon = gpd.GeoSeries(Polygon([[leeds_point.x,leeds_point.y],[man_point.x,man_point.y],[birm_point.x,birm_point.y]]))

And plot all of them together:

In [None]:
fig, (poly1,poly2) = plt.subplots(ncols=2, sharex=True, sharey=True)

lon_man_birm_polygon.plot(ax=poly1, color='lightblue', edgecolor='black',alpha=0.5);
lon_man_line.plot(ax=poly1,color='violet',alpha=0.5);
man_birm_line.plot(ax=poly1,color='blue',alpha=0.5);
birm_lon_line.plot(ax=poly1,color='green',alpha=0.5);

leeds_man_birm_polygon.plot(ax=poly2, color='yellow', edgecolor='black',alpha=0.5);
leeds_man_line.plot(ax=poly2,color='red',alpha=0.5);
man_birm_line.plot(ax=poly2,color='blue',alpha=0.5);
birm_leeds_line.plot(ax=poly2,color='green',alpha=0.5);

### Overlay

With overlay you can combine geometries, for instance union, difference, symmetrical difference and intersection are some of the operations that can be performed.

Let's combine the 2 polygons:

In [None]:
poly1 = gpd.GeoDataFrame({'geometry': lon_man_birm_polygon})
poly2 = gpd.GeoDataFrame({'geometry': leeds_man_birm_polygon})

gpd.overlay( poly1, poly2, how='union').plot(color='red',alpha=0.5);

### Buffer

In [None]:
cities1 = cities[0:1].copy()
cities1.head()

In [None]:
base = cities1.buffer(3).plot(color='blue',alpha=0.5);
cities1.buffer(2).plot(ax=base,color='green',alpha=0.5);
cities1.buffer(1).plot(ax=base,color='yellow',alpha=0.5);
cities1.plot(ax=base,color='red',alpha=0.5);

##### Spatial relationships

There are several functions to check geospatial relationships: `equals`, `contains`, `crosses`, `disjoint`,`intersects`,`overlaps`,`touches`,`within` and `covers`. These all use `shapely`: read more [here](https://shapely.readthedocs.io/en/stable/manual.html#predicates-and-relationships) and some more background [here](https://en.wikipedia.org/wiki/Spatial_relation).

A few examples:

In [None]:
cities.head()

In [None]:
cities1.head()

In [None]:
cities1.contains(lon_point)

In [None]:
cities1[cities1.contains(lon_point)]

In [None]:
cities[cities.contains(man_point)]

The inverse of `contains`:

In [None]:
cities[cities.within(cities1)]

In [None]:
cities[cities.disjoint(lon_point)]

<a id="ready"></a>
## 2. Getting ready

### 2.1. Add data to Cloud Object Store (COS)
The data for this workshop needs to be added to your project. Go to the GitHub repo and download the files in the [data folder](https://github.com/IBMDeveloperUK/python-geopandas-workshop/tree/master/data) to your machine. 

Add the files in the data menu on the right of the notebook (click the 1010 button  at the top right if you do not see this) into COS:

- boundaries.zip
- 2018-1-metropolitan-street.zip
- 2018-2-metropolitan-street.zip
- 2018-metropolitan-stop-and-search.zip
- london_inner_pois.zip


### 2.2. Project Access token

As the data files are not simple csv files, we need a little trick to load the data. The first thing you need is a project access token to programmatically access COS.

Click the 3 dots at the top of the notebook to insert the project token that you created earlier. This will create a new cell in the notebook that you will need to run first before continuing with the rest of the notebook. If you are sharing this notebook you should remove this cell, else anyone can use you Cloud Object Storage from this project.

> If you cannot find the new cell it is probably at the top of this notebook. Scroll up, run the cell and continue with section 2.3

### 2.3. Helper function to load data into notebook

The second thing you need to load data into the notebook is the below help function. Data will be copied to the local project space and loaded from there. The below helper function will do this for you. 

In [None]:
# define the helper function 
def download_file_to_local(project_filename, local_file_destination=None, project=None):
    """
    Uses project-lib to get a bytearray and then downloads this file to local.
    Requires a valid `project` object.
    
    Args:
        project_filename str: the filename to be passed to get_file
        local_file_destination: the filename for the local file if different
        
    Returns:
        0 if everything worked
    """
    
    project = project
    
    # get the file
    print("Attempting to get file {}".format(project_filename))
    _bytes = project.get_file(project_filename).read()
    
    # check for new file name, download the file
    print("Downloading...")
    if local_file_destination==None: local_file_destination = project_filename
    
    with open(local_file_destination, 'wb') as f: 
        f.write(bytearray(_bytes))
        print("Completed writing to {}".format(local_file_destination))
        
    return 0

<a id="boroughs"></a>
## 3. London boroughs

<a id="load1"></a>
### 3.1. Load data

Geospatial data comes in many formats, but with GeoPandas you can read most files with just one command. For example this geojson file with the London boroughs:

In [None]:
# load data from a url
boroughs = gpd.read_file("https://skgrange.github.io/www/data/london_boroughs.json")
boroughs.head()

<a id="explore1"></a>
### 3.2. Explore data

To plot a basic map add `.plot()` to a geoDataFrame.  

In [None]:
boroughs.plot();

In [None]:
boroughs.plot(column='code');

In [None]:
boroughs.plot(column='area_hectares');

### Dissolve

The boroughs are made up of many districts that you might want to combine. For this example this can be done by adding a new column and then use `.dissolve()`:

In [None]:
boroughs['all'] = 1
allboroughs = boroughs.dissolve(by='all',aggfunc='sum')
allboroughs.head()

In [None]:
allboroughs.plot();

To change the size of the map and remove the box around the map, run the below:

In [None]:
[fig, ax] = plt.subplots(1, figsize=(10, 6))
allboroughs.plot(ax=ax);
ax.axis('off');

### Join

Let's join this with some more data: 

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/IBMDeveloperUK/python-pandas-workshop/master/london-borough-profiles.csv',encoding = 'unicode_escape')

In [None]:
df.head()

The columns to join the two tables on are `code` and `Code`. To use the join method, first the index of both tables has to be set to this column.

The below adds the columns from `df` to `boroughs`:


In [None]:
boroughs = boroughs.set_index('code').join(df.set_index('Code'))
boroughs.head()

In [None]:
boroughs2 = boroughs.dissolve(by='Inner/_Outer_London',aggfunc='mean')

[fig, ax] = plt.subplots(1, figsize=(10, 6))
boroughs2.plot(column='id', cmap='Paired', linewidth=0.5, edgecolor='black', legend=False, ax=ax);
ax.axis('off');

Below is a map of the average gender pay gap for each borough. 

* add a new column `paygap`
* define the size of the plot
* plot the background 
* add the paygap data and a title

In [None]:
boroughs['paygap'] =((boroughs['Gross_Annual_Pay_-_Male_(2016)'] - boroughs['Gross_Annual_Pay_-_Female_(2016)'])/ \
    boroughs['Gross_Annual_Pay_-_Male_(2016)']) * 100

[fig,ax] = plt.subplots(1, figsize=(12, 8))

boroughs.plot(ax=ax, color="lightgrey", edgecolor='black', linewidth=0.5)

boroughs.dropna().plot(column='paygap', cmap='Reds', edgecolor='black', linewidth=0.5,
               legend=True, ax=ax);
ax.axis('off');
ax.set_title('Gender pay gap in London (2016)');

<a id="crime"></a>
## 4. Crime data

The crime data is pre-processed in this [notebook](https://github.com/IBMDeveloperUK/geopandas-workshop/blob/master/notebooks/prepare-uk-crime-data.ipynb) so it is easier to read here. We will only look at data from 2018.

Data is downloaded from https://data.police.uk/ ([License](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/))

<a id="load2"></a>
### 4.1. Load data

This dataset cannot be loaded into a geoDataFrame directly. Instead the data is loaded into a DataFrame and then converted:

In [None]:
download_file_to_local('2018-1-metropolitan-street.zip', project=project)
download_file_to_local('2018-2-metropolitan-street.zip', project=project)
street = pd.read_csv("./2018-1-metropolitan-street.zip")
street2 = pd.read_csv("./2018-2-metropolitan-street.zip")
street = street.append(street2) 

In [None]:
download_file_to_local('2018-metropolitan-stop-and-search.zip', project=project)
stop_search = pd.read_csv("./2018-metropolitan-stop-and-search.zip")

Clean up of the local directory:

In [None]:
! rm *.zip

In [None]:
street.head()

In [None]:
stop_search.head()

#### Convert to geoDataFrames

In [None]:
street['coordinates'] = list(zip(street.Longitude, street.Latitude))
street['coordinates'] = street['coordinates'].apply(Point)
street = gpd.GeoDataFrame(street, geometry='coordinates')
street.head()

In [None]:
stop_search['coordinates'] = list(zip(stop_search.Longitude, stop_search.Latitude))
stop_search['coordinates'] = stop_search['coordinates'].apply(Point)
stop_search = gpd.GeoDataFrame(stop_search, geometry='coordinates')
stop_search.head()

<a id="explore2"></a>
### 4.2. Explore data


<div class="alert alert-success">
 <b>EXERCISE</b> <br/> 
 Explore the data with Pandas. There are no right or wrong answers, the questions below give you some suggestions at what to look at. <br/> 
   <ul>
  <li>How much data is there? Is this changing over time? Can you plot this? </li>
  <li>Are there missing values? Should these rows be deleted?  </li>
  <li>Which columns of the datasets contain useful information? What kind of categories are there and are they all meaningful?</li>
  <li>Which crimes occur most often? And near which location?</li>
  <li>Is there anything you want to explore further or are curious about? Is there any data that you will need for this?</li>      
  <li>Notice anything odd about the latitude and longitudes? Read here how the data is anonymised: https://data.police.uk/about/.</li>       
  </ul> 
    
  Uncomment and run the cells starting with '# %load' to see some of the things that we came up with. Run each cell twice, once to load the code and then again to run the code.  
</div>  

In [None]:
# your data exploration (add as many cells as you need by clicking the `+` at the top of the notebook)


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-geopandas-workshop/master/answers/answer1.py
print ('rows in street: '+str(len(street)))

# columns 
print ('Columns: '+str(street.columns))

# categories
print ('Crime type: '+str(street['Crime type'].unique()))
print ('Last outcome category: '+str(street['Last outcome category'].unique()))
print (street['Context'].unique())

# delete columns
street = street.drop(columns=['Unnamed: 0','Latitude', 'Longitude','Context'])

# convert Date to datetime
street['Month'] = street['Month'].apply(lambda x: datetime.strptime(x, "%Y-%m"))

street.head()


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-geopandas-workshop/master/answers/answer2.py
bystreet = street.groupby(['Location','Crime type']).count()
bystreet = bystreet.drop(columns=['Month', 'Last outcome category','coordinates','LSOA code'])
bystreet = bystreet.rename(index=str, columns={"Crime ID": "Number of crimes"})

bystreet.sort_values(by=['Number of crimes'], ascending=False).head()


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-geopandas-workshop/master/answers/answer3.py

In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-geopandas-workshop/master/answers/answer4.py

In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-geopandas-workshop/master/answers/answer5.py

Some things we noticed:
* The number of stop and searches seems to go up. That is something you could investigate further. Is any of the categories increasing? 
* Another interesting question is how the object of search and the outcome are related. Are there types of searches where nothing is found more frequently? 
* In the original files there are also columns of gender, age range and ethnicity. If you want to explore this further you can change the code and re-process the data from this [notebook](https://github.com/IBMDeveloperUK/geopandas-workshop/blob/master/notebooks/prepare-uk-crime-data.ipynb) and use the full dataset.
* And how could you combine the two datasets?

### Spatial join

> The below solution was found [here](https://gis.stackexchange.com/questions/306674/geopandas-spatial-join-and-count) after googling for 'geopandas count points in polygon'

The coordinate system (`crs`) needs to be the same for both GeoDataFrames. 

In [None]:
print(boroughs.crs)
print(stop_search.crs)

Add a borough to each point with a spatial join. This will add the `geometry` and other columns from `boroughs2` to the points in `stop_search`. 

In [None]:
stop_search.crs = boroughs.crs
dfsjoin = gpd.sjoin(boroughs,stop_search) 
dfsjoin.head()

Then aggregate this table by creating a [pivot table](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html) where for each borough the number of  categories in `Object of search` are counted. Then drop the pivot level and remove the index, so you can merge this new table back into the `boroughs2` DataFrame.

In [None]:
dfpivot = pd.pivot_table(dfsjoin,index='id',columns='Object of search',aggfunc={'Object of search':'count'})
dfpivot.columns = dfpivot.columns.droplevel()
dfpivot = dfpivot.reset_index()
dfpivot.head()

In [None]:
boroughs3 = boroughs.merge(dfpivot, how='left',on='id')
boroughs3.head()

Let's make some maps!

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(20,5))

p1=boroughs3.plot(column='Controlled drugs',ax=axs[0],cmap='Blues',legend=True);
axs[0].set_title('Controlled drugs', fontdict={'fontsize': '12', 'fontweight' : '5'});

p2=boroughs3.plot(column='Stolen goods',ax=axs[1], cmap='Reds',legend=True);
axs[1].set_title('Stolen goods', fontdict={'fontsize': '12', 'fontweight' : '5'});


<div class="alert alert-success">
 <b>EXERCISE</b> <br/> 
 Explore the data with GeoPandas. Again there are no right or wrong answers, the questions below give you some suggestions at what to look at. <br/> 
   <ul>
  <li>Improve the above maps. How many arrests are there in each borough? Use the above method but first select only the arrests using the column 'Outcome'. Can you plot this? </li>
  <li>Are there changes over time? Is there a difference between months? Use `street` and look at Westminster or another borough where the crime rate seems higher. </li>    
  </ul> 
</div>  

In [None]:
# your data exploration (add as many cells as you need)


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-geopandas-workshop/master/answers/answer6.py

In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-geopandas-workshop/master/answers/answer7.py

<a id="osm"></a>
## 5. OSM data

The Open Street Map data is also pre-processed in this [notebook]() so it is easier to read into this notebook. 

Data is downloaded from http://download.geofabrik.de/europe/great-britain.html and more details decription of the data is [here](http://download.geofabrik.de/osm-data-in-gis-formats-free.pdf).

<a id="load3"></a>
### 5.1. Load data

In [None]:
download_file_to_local('london_inner_pois.zip', project=project)
pois = gpd.read_file("zip://./london_inner_pois.zip")
pois.head()

<a id="explore3"></a>
### 5.2. Explore data

In [None]:
pois.size

In [None]:
pois['fclass'].unique()

Count and plot the number of pubs by borough:

In [None]:
pubs = pois[pois['fclass']=='pub']

pubs2 = gpd.sjoin(boroughs,pubs) 
pubs3 = pd.pivot_table(pubs2,index='id',columns='fclass',aggfunc={'fclass':'count'})
pubs3.columns = pubs3.columns.droplevel()
pubs3 = pubs3.reset_index()
boroughs5 = boroughs.merge(pubs3, left_on='id',right_on='id')

boroughs5.plot(column='pub',cmap='Blues',legend=True);

<div class="alert alert-success">
 <b>EXERCISE</b> <br/> 
 Explore the data further. Again there are no right or wrong answers, the questions below give you some suggestions at what to look at. <br/> 
   <ul>
  <li> Is there a category of POIs that relates to the number of crimes? You might have to aggregate the data on a different more detailed level for this one. </li>
  <li> Can you find if there is a category of POIs that related to the number of crimes?  </li>
  <li> Count the number of crimes around a certain POI. Choose a point and use the buffer function from the top of the notebook. But note that the crimes are anonymised, so the exact location is not given, only an approximation.  </li>
       
  </ul> 
</div>  

In [None]:
# answers


Hopefully you got an idea of the possibilities with geospatial data now. There is a lot more to explore with this data. Let us know if you find anything interesting! We are on Twitter as @MargrietGr and @yaminigrao

#### Author 

Margriet Groenendijk is a Data & AI Developer Advocate for IBM. She develops and presents talks and workshops about data science and AI. She is active in the local developer communities through attending, presenting and organising meetups and conferences. She has a background in climate science where she explored large observational datasets of carbon uptake by forests during her PhD, and global scale weather and climate models as a postdoctoral fellow. 




Copyright Â© 2019 IBM. This notebook and its source code are released under the terms of the MIT License.