# Introduction to GeoData formats
## Shapefile
A Shapefile has three required files:
* .shp - geometry
* .dbf - attributes
* .shx - index  

and there are three additional files:
* .prj - projection/ Coordinate system
* .sbn/sbx - spatial index
* .xml - metadata
<div>
<img src="./img/shapefile.jpg" width="250"/>
</div>
Picture of the possible data files in a Shapefile.  

The shape types in a Shapefile are:
* points
* lines
* polygons
* and multiples of them

Every geometric feature has a geometry and attribute information.

Shapefile is binary based, which means you can not open it by using a text editor. In Python, you can read a Shapefile by using the GeoPandas package. Once in a GeoDataFrame, you can further edit like a parsed GeoJSON. By slicing the table and extract information.

In [None]:
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
bezirke = gpd.read_file("./data/aachen/StatistischeBezirkeAachen.shp")
bezirke.head()

Shapefile is used for storing Simple-Feature-Geometry, but:
* NULL-values not saved
* Numbers are rounded
* Unicode strings insufficiently supported
* Field name length also limited to 10 characters
* Date fields without time

## GeoJSON
GeoJSON is, after all, a JSON file, so it can be read as one. Therefore, geo-spacialized python packages such as GeoPandas are able to read GeoJSON.  
GeoJSON supports additionally to JSON geometry types:  
* Point
* LineString
* Polygon
* MultiPoint
* MultiLineString
* MultiPolygon
* Additional properties: Feature objects  

<div>
<img src="./img/geojson_bsp.jpg" width="500"/>
</div>
As the following example shows, sometimes files are not correct done and need to be improved. When using the following points as they are right now, one end up on the ocean.

In [1]:
import geopandas as gpd
import matplotlib
import matplotlib.pyplot as plt
import contextily
new_geoj = gpd.read_file('data/knotenpunkte-wald_ac.geojson')
new_geoj = new_geoj.to_crs(epsg=4326)
new_geoj.head()

Unnamed: 0,id,knotennr,geometry
0,knotenpunkte.1,4,POINT (6.07890 50.76305)
1,knotenpunkte.2,19,POINT (6.10303 50.75383)
2,knotenpunkte.3,14,POINT (6.07006 50.75377)
3,knotenpunkte.4,16,POINT (6.06168 50.74572)
4,knotenpunkte.5,15,POINT (6.06156 50.74399)


Let us have a look at a GeoJSON and a GeoDataFrame, and how the information from the GeoJSON are stored in a GeoDataFrame. But make sure to try for yourself. Open a GeoJSON file and take a look at the structure. Compare it to the table you get when reading a GeoJSON using Python Geopandas.

<div>
<img src="./img/geojson_dataframe.png" width="1000"/>
</div>

The information are now stored in a GeoDataFrame. You can move on now and slice the table to get specific information (for example, about the coordinates). To get an idea about how the slicing can be done, we slice the **knotenpunkte-wald_ac.geojson** file for their **geometry** information. The **head()** method helps, to display the sliced information.

In [None]:
slic_po = new_geoj['geometry']
slic_po.head()

We can also do other operations on the GeoDataFrame or even the sliced GeoDataFrame. In the following code cell, we use a **for-loop** to iterate over our sliced data, printing ever single element.

In [None]:
for elem in new_geoj['geometry']:
    print (elem)

A plot of these point data does not show much, due to the fact, that there is no relation to geometric shapes or a visualization as a map.

In [None]:
tab = new_geoj.plot(figsize=(6, 6))
plt.show()

Type is helpful, if you want to know what sort of data you are working with.

In [None]:
type(new_geoj.geometry[0])

Next, we just change the colour of the points representing a knot, depending on their number and order. That does not help us localize each point, they are *just* coloured now, but still not have a visualized reference.

In [None]:
ax = new_geoj.set_geometry('geometry')\
                .plot('knotennr', 
                      markersize=20)

In [None]:
type(ax)

### EPSG
Here for the first time the EPSG comes up. **EPSG** - **E**uropean **P**etroleum **S**urvey **G**roup **G**eodesy was a working group for Oil and Gas discovering. The EPSG Code is a system of worldwide defined 4-5 membering keynumber for coordination reference systems. 

In [None]:
#Change the epsg
new_geoj= new_geoj.to_crs(epsg=3857)

When creating plots, it is necessary to add a basemap. The geographical allocation is helpful to locate the points. We just changed the EPSG, which is important here, because *contextily* uses web base provider for their maps and each of them could have a different EPSG required.  

In [None]:
ax = new_geoj.plot(
    column='id', 
    legend=True, 
    edgecolor='none', 
    figsize=(12, 12)
)
#Add a basemap for a better location of information
contextily.add_basemap(
    ax,
    crs=new_geoj.crs.to_string(), 
    source=contextily.providers.CartoDB.Positron
)

## GeoPackage
The following picture shows the mandatory and optional tables of a GeoPackage. They can be split in two categories:
* Metadata tables
* User defined data tables 

There are two mandatory metadata tables (the purple coloured ones in the picture):

* contents
* spatial ref sys  

<div>
<img src="./img/geopackage-overview.png" width="750"/>
</div> 
Picture of the possible tables of GeoPackage (from https://www.geopackage.org/spec120/).  

GeoPackage is binary based, which means you can not open it using a text editor. To work with special data such as GeoPackages we need geo-specialised python packages such as GeoPandas. First we could analyse the data, by checking the Indexes, Shape, and we can plot the table here. The 'aachen_network' file contains a node network and an edge network, make sure to set the layer you want correctly, by default it is 'node'.

In [None]:
import geopandas as gpd
import contextily
#Loading the GeoPackage file, specifing the layer
file_gb = gpd.read_file("./data/aachen_network.gpkg", layer="nodes")
file_gb.columns

To get information about the GeoDataFrames shape, use the **.shape** method. The output is a tuple containing the number of rows and columns. First, number represents the rows, the second number represents the columns.

In [None]:
file_gb.shape

To check if the two values are correct, we display the GeoDataFrame. The eight headline names: *Osmid, y, x, stree_count, lon, lat, highway, geometry* represent the eight columns. Printing all the rows, means printing over 2755 lines. Make sure to remember, the counting in Python starts with 0. 

In [None]:
file_gb.head(2760)

Let us plot the geoinformation using the .plot() method. The result does not really show a visualized geolocation for humans easy to read and identify. But it geolocates the points in a 2D graph.

In [None]:
file_gb.plot(markersize=0.1)

In [None]:
#Changing epsg and print it 
file_gb= file_gb.to_crs(epsg=3857)
file_gb.head()

In [None]:
ax = file_gb.plot(figsize=(10, 10), alpha=0.5, edgecolor='k')
contextily.add_basemap(
    ax,
    crs=file_gb.crs.to_string(), 
    source=contextily.providers.CartoDB.Positron
)