# Introduction to geospatial machine learning
This notebook covers the preliminary ground of basic geospatial datatypes and how to use geopandas to understand and manipulate them.

### What is GeoPandas ?
GeoPandas is an opensource library and enables the use and manipulation of geospatial data in Python. It extends the common datatype used in pandas to allow for the many and unique geomertic operations: GeoSeries and GeoDataFrames. GeoPandas is also built on tpo of shapely for it's geometric operation; it's unerlying datatype allows Geopandas to run blazingly fast and is appropriate for many machine learning pipelines tha require large geospatial datasets.

### Geospatial concepts

#### A. Geospatial common datatypes
There are some common geospatial datatypes that you need to be familiar with: Shapefile (.shp) and GeoJSON (.geojson).
Shapefile is a vector data format that is developed and maintained mostly by a company called ESRI. It stores many important geospatial information including the topology, shape geometry, etc.

GeoJSON, similar to JSON, stores geometry information (coordinates, projection, etc) in addition to your typical attributes relevant to the object (index, name, etc).

Once you load either of these dataformat using Geopandas, the library will create a DataFrame with the additional geometry column.

This is how you import the default geodata built-in within the Geopandas library that we are going to use in this and subsequent posts.

In [1]:
import geopandas

path_to_data = geopandas.datasets.get_path("nybb")
gdf = geopandas.read_file(path_to_data)

gdf

Unnamed: 0,BoroCode,BoroName,Shape_Leng,Shape_Area,geometry
0,5,Staten Island,330470.010332,1623820000.0,"MULTIPOLYGON (((970217.022 145643.332, 970227...."
1,4,Queens,896344.047763,3045213000.0,"MULTIPOLYGON (((1029606.077 156073.814, 102957..."
2,3,Brooklyn,741080.523166,1937479000.0,"MULTIPOLYGON (((1021176.479 151374.797, 102100..."
3,1,Manhattan,359299.096471,636471500.0,"MULTIPOLYGON (((981219.056 188655.316, 980940...."
4,2,Bronx,464392.991824,1186925000.0,"MULTIPOLYGON (((1012821.806 229228.265, 101278..."


### Introduction to basic geometry attributes
Now that we have some ideas of geospatial data and how to import our very first one using Geopandas, lets perform some basic methods to further cement our understanding.

In [2]:
gdf = gdf.set_index("BoroName")

#### Area
From the geometry column, we can measure the areas (if they are of type POLYGON or MULTIPOLYGON: since we can’t measure the area of lines or points)

In [4]:
gdf["area"] = gdf.area
gdf["area"]

BoroName
Staten Island    1.623822e+09
Queens           3.045214e+09
Brooklyn         1.937478e+09
Manhattan        6.364712e+08
Bronx            1.186926e+09
Name: area, dtype: float64

#### Polygon Boundary
Since our geometry is of type polygon or multipolygon, we can extract out the line coordinates of the objects. This can be useful when, say, we want to measure the perimeter of the polygon objects, etc.

In [5]:
gdf["boundary"] = gdf.boundary
gdf["boundary"]

BoroName
Staten Island    MULTILINESTRING ((970217.022 145643.332, 97022...
Queens           MULTILINESTRING ((1029606.077 156073.814, 1029...
Brooklyn         MULTILINESTRING ((1021176.479 151374.797, 1021...
Manhattan        MULTILINESTRING ((981219.056 188655.316, 98094...
Bronx            MULTILINESTRING ((1012821.806 229228.265, 1012...
Name: boundary, dtype: geometry

#### Centroid
If you want to find the centroid point of the given polygons, you can call the gdf attribute as follows.

In [6]:
gdf["centroid"] = gdf.centroid
gdf["centroid"]

BoroName
Staten Island     POINT (941639.450 150931.991)
Queens           POINT (1034578.078 197116.604)
Brooklyn          POINT (998769.115 174169.761)
Manhattan         POINT (993336.965 222451.437)
Bronx            POINT (1021174.790 249937.980)
Name: centroid, dtype: geometry

#### Distance
Now that wealready know the positions of the centroids and wanted to find out where the distance between Queens and everywhere else, this can be done easily using the distance() method:

In [7]:
queens_centroid = gdf["centroid"].iloc[1]
gdf['distance2queens'] = gdf['centroid'].distance(queens_centroid)

You can then perform many spatial aggregates function to find out the mean, max, or min distances.

In [9]:
print("The average distance to Queens is : ", gdf['distance2queens'].mean())
print("The maximum distance to Queens is : ", gdf['distance2queens'].max())
print("The minimum distance to Queens is : ", gdf['distance2queens'].min())

The average distance to Queens is :  49841.727323648636
The maximum distance to Queens is :  103781.53527578666
The minimum distance to Queens is :  0.0
