In [None]:
""" plotting with help """
# source: https://medium.com/@sukantkhurana/analyzing-geospatial-data-with-geopandas-and-plotly-b13dedcbe466

In [None]:
import pandas as pd
import geopandas as gpd

In [None]:
# dataset needed

"""
Open in app
Sukant Khurana
673 Followers
About
Open in app

Analyzing geospatial data with GeoPandas and plotly
Sukant Khurana

Sukant Khurana

Feb 23, 2019·8 min read

by Siddharth Dutta

(The article is written entirely by my student Siddharth, as part of assignment to learn about geospatial data plotting in Python. I have not edited a word so all praise and criticism are his. I found this work worthy of sharing, so putting it on my blog with due credit to him)

GeoPandas

The goal of GeoPandas is to make working with geospatial data in python easier. It combines the capabilities of pandas and shapely, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely. GeoPandas enables you to easily do operations in python that would otherwise require a spatial database such as PostGIS.

In this article we are going to study about using GeoPandas on a geospatial dataset about all the countries of the world.

Before we get to the visuals, let’s talk about shape files. A .shp file is actually just one of the 3 files that are necessary for your shape data to render — you may have noticed that a .shp file comes in a zip file when you download. This is because you also need a .dbf and a .shx file. While the .shp files gives figures their geometry, .dbf stores attribute data and object ids, and .shx provides an index that is helpful with certain software and packages (such as pyshp and AutoCAD).

GeoPandas (and shapely for theindividual objects) provides a whole lot of basic methods to analyse the geospatial datan (distance,length,centroid,boundary,convex_hull,simplify,transform,..).

GeoDataFrame is a data frame which has a ‘geometry’ column. The .geometry attribute returns a GeoSeries (the column name itself is not necessarily ‘geometry’).

GeoSeries is a Series that holds (shapely) geometry objects (Points, LineStrings, Polygons, …).

Plotly

The plotly Python package is an open-source library built on plotly.js, which in turn is built on d3.js.

We can create graphs which are more interactive in nature than matplotlib. It also allows us to perform better analysis and draw accurate insights from our observations.
GIS Spatial Data Types
There are two broad categories of geospatial data-:

1.Raster data: It is made up of pixels (also referred to as grid cells). They are usually evenly spaced and square. Raster data often look pixelated because each pixel has its own value or class. E.g-: Satellite images.

2.Vector data: Vector data is split into three types: polygon, line (or arc) and point data. Polygons are used to represent areas such as the boundary of a city (on a large scale map), lake, or forest. Polygon features are two dimensional and therefore can be used to measure the area and perimeter of a geographic feature. Polygon features are most commonly distinguished using either a thematic mapping symbol (color schemes), patterns, or in the case of numeric gradation, a color gradation scheme could be used.

Visualization in Spatial Data Analysis

Here we use our dataset to describe a common type of geo-visualization used for area unit data with numeric attributes, namely choropleth maps. Choropleth maps play a prominent role in spatial data science. The word choropleth stems from the root “choro” meaning “region”. As such choropleth maps are appropriate for areal unit data where each observation combines a value of an attribute and a polygon. Choropleth maps derive from an earlier era where cartographers faced technological constraints that precluded the use of unclassed maps where each unique attribute value could be represented by a distinct symbol. Instead, attribute values were grouped into a smaller number of classes with each class being associated with a unique symbol that was in turn applied to all polygons with attribute values falling in the class.

The effectiveness of a choropleth map will be a function of the choice of classification scheme together with the color or symbolization strategy adopted. In broad terms, the classification scheme defines the number of classes as well as the rules for assignment, while the symbolization should convey information about the value differentiation across the classes.

Lets Load our dataset:

import pandas as pd

import geopandas as gpd

df = gpd.read_file('data/berlin-districts.geojson')import seaborn as sbn

We can use seaborn to visualize the statistical distribution of the median price of listings across districts:

sbn.distplot(df['median_price'])

The distplot combines a histogram with a kernel density. Both reflect a right-skewed distribution, not uncommon for housing rents, or urban income distributions.

sbn.distplot(df['median_price'], rug=True)

Adding the rug argument provides additional insight as to the distribution of specific observations across the price range. The histogram and density give us a sense of the "value" distribution. From a spatial data science perspective, we are also interested in the "spatial" distribution of these values.

Since we have a GeoDataFrame we can call the plot method to generate a default choropleth:

Alternatively, we can plot a choropleth map using plotly by following the given instructions-:

(NOTE-: The data comprises of information about public Education spending of various states of america)

import plotly.offline as plt

df=pd.read_csv(‘states.csv’)

data=[dict(type=’choropleth’, autocolorscale = False,

locations=df[‘code’], z=df[‘dollars’], locationmode=’USA-states’, colorscale=’custom-colorscale’, colorbar=dict(title=’thousand dollars’))]

layout = dict(title=’state spending on public education’, geo=dict(scope=’usa’, projection=dict(type=’albers usa’), showlakes=True, lakecolor=’rgb(66,165,245)’))

fig=dict(data=data, layout=layout)

plt.plot(fig)
Kernel Regressions

Kernel regressions are one exceptionally common way to allow observations to “borrow strength” from nearby observations.

However, when working with spatial data, there are two simultaneous senses of what is near:

· Things that are similar in attribute (classical kernel regression)

· Things that are similar in spatial position (spatial kernel regression)

Below, we’ll walk through how to use scikit to fit these two types of kernel regressions, show how it’s not super simple to mix the two approaches together, and refer to an approach that does this correctly in another package.

model_data = listings[[‘accommodates’,‘review_scores_rating’,’bedrooms’, ‘bathrooms’, ‘beds’,’price’, ‘geometry’]].dropna()

Xnames = [‘accommodates’, ‘review_scores_rating’,’bedrooms’, ‘bathrooms’, ‘beds’]

X = model_data[Xnames].values X = X.astype(float) y = np.log(model_data[[‘price’]].values)

coordinates = np.vstack(model_data.geometry.apply(lambda p: np.hstack(p.xy)).values)

scikit neighbor regressions are contained in the sklearn.neighbors module, and there are two main types:

· KNeighborsRegressor, which uses a k-nearest neighborhood of observations around each focal site

· RadiusNeighborsRegressor, which considers all observations within a fixed radius around each focal site.

Further, these methods can use inverse distance weighting to rank the relative importance of sites around each focal; in this way, near things are given more weight than far things, even when there’s a lot of near things.

import sklearn.neighbors as skn

import sklearn.metrics as skm

shuffle = np.random.permutation(len(y))

train,test =shuffle[:14000],shuffle[14000:]

So, let’s fit three models:

· spatial: using inverse distance weighting on the nearest 500 neighbors geograpical space

· attribute: using inverse distance weighting on the nearest 500 neighbors in attribute space

· both: using inverse distance weighting in both geographical and attribute space.

KNNR = skn.KNeighborsRegressor(weights=’distance’, n_neighbors=500)

spatial = KNNR.fit(coordinates[train,:], y[train,:])

KNNR = skn.KNeighborsRegressor(weights=’distance’, n_neighbors=500)

attribute = KNNR.fit(X[train,:],y[train,])

KNNR = skn.KNeighborsRegressor(weights=’distance’, n_neighbors=500) both = KNNR.fit(np.hstack((coordinates,X))[train,:],y[train,:])

To score them, I’m going to grab them out of sample prediction accuracy and get their % explained variance:

sp_ypred =spatial.predict(coordinates[test,:])

att_ypred = attribute.predict(X[test,:])

both_ypred = both.predict(np.hstack((X,coordinates))[test,:])

(skm.explained_variance_score(y[test,], sp_ypred),

skm.explained_variance_score(y[test,],att_ypred),

skm.explained_variance_score(y[test,], both_ypred))

result->> (0.1443088606590084, 0.3149860849884514, -5.684468673550214e-09)ApplicationNow we are going to use GeoPandas to visualize data comprising of road accidents in Auckland. Location of data is stored in the local variable path.

import geopandas as gpd

import matplotlib.pyplot as plt

import pandas as pd

roads = gpd.read_file(str(path))

f, ax = plt.subplots(1, figsize=(12, 12))

ax = roads.plot(column=’type’, ax=ax)

plt.show()

Above we have obtained the road map of Auckland. To help the viewers draw better insights from the plot, we can scatter plot the accident locations on it.Dataset comprising of road  accidents is   stored in local variabel  path1.

crashes = pd.read_csv(path1)

f = crashes.copy()

cond = f[‘LG_REGION_DESC’].str.contains(‘Auckland’)

cond &= (f[‘EASTING’] > 0) & (f[‘NORTHING’] > 0)

f = f[cond].copy()

crashes = gpd.GeoDataFrame(f, crs=NZTM, geometry=geometry)

f, ax = plt.subplots(1, figsize=(12, 12))

base = roads.plot(color=’black’, ax=ax)

crashes.plot(ax=base, marker=’o’, color=’red’, markersize=3)

plt.show()

We have discussed only a few application ofGeoPandas in this article. It can also be used in combination with other packages like shapely, fiona and plotly for advanced data visualization. The key to data visualization is to know at what circumstances or conditions a specific package is suited.

When should we use GeoPandas?

    For exploratory data analysis, including in Jupyter notebooks.
    For highly compact and readable code. Which in turn improves reproducibility.
    If we are comfortable with Pandas, R dataframes, or tabular/relational approaches.

When it may not be the best tool?

    For polished map creation and multi-layer, interactive visualization; if we are comfortable with GIS software, one option is to use a desktop GIS like QGIS. We can generate intermediate GIS files and plots with GeoPandas, then shift over to QGIS. Or refine the plots in Python with matplotlib or additional packages. GeoPandas can help us manage and pre-process the data, and do initial visualizations.
    If we need very high performance. Performance has been increasing and substantial enhancements are in the works (including possibly a parallelization implementation).

Advantages of using plotly:

1.We can create interactive plots with plotly which help us to draw better insights from the given data.

2. After we learn the basic idea of the library, it allows us to rapidly build stunning visualizations.

3.Once we get accustomed to the basics of the library, it is very user friendly in nature.

4.Plotly uses declarative programming, which means less effort spent building up a figure, allowing us to focus on what to present and how to interpret it.

In this article, we explored some features and applications of GeoPandas. To explore more properties of this amazing python package, we can visit-: http://geopandas.org
“Introduction to Geospatial Data Analysis with Python

” is also a great visual reference material to start with-:

https://www.youtube.com/watch?v=kJXUUO5M4ok

References:

Following tutorial is a great example driven approach to get started with GeoPandas:

https://github.com/mrcagney/introducing_geopandas

To explore more properties of Plotly, we can visit-: https://plot.ly/python

    Data Science
    Data Visualization
    Data Analysis
    Visualization Tool
    Analysis

More from Sukant Khurana

Emerging tech, edtech, AI, neuroscience, drug-discovery, design-thinking, sustainable development, art, & literature. There is only one life, use it well.
More From Medium
Scaling Quality Training Data: Best Practices for Your Data Production Line
CloudFactory
Health Data Science FAQ Series| 002 — Technology Tools Part 1
Dalton Fabian in The Data Science Pharmacist
Here’s what Airbnb data tells us about the past year
Amber Standish
Let’s make a bet; Leafs win the Stanley Cup?
Patrick Chong
Spatial Data to Enhance The Logistic Distribution for Victims in Palu Earthquake
Sry Handini Puteri
Welcome To Patchwords
Sally
Mercari Price Suggestion Challenge: A Machine Learning Case Study
Pratikpophali
🤹‍♀️ Does Skill Sharing Influence Communication?
Alessandro Marchesin

About

Write

Help

Legal

Get the Medium app
A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store

"""