<div align="center"><img src="../images/LKYCIC_Header.jpg"></div>

**Table of contents**<a id='toc0_'></a>    
- [2-01: Vector Data Analysis](#toc1_)    
  - [Vector](#toc1_1_)    
    - [Point](#toc1_1_1_)    
      - [Creating single point](#toc1_1_1_1_)    
      - [Creating a point collection](#toc1_1_1_2_)    
    - [Read Local Files as GeoDataFrame](#toc1_1_2_)    
      - [Read point data from file (ESRI Shapefile)](#toc1_1_2_1_)    
  - [Line](#toc1_2_)    
    - [Read line data from file (GeoJSON)](#toc1_2_1_)    
    - [Create line from point (From MRT station to MRT line)](#toc1_2_2_)    
  - [Polygon](#toc1_3_)    
    - [Read polygon data from file (ESRI shapefile)](#toc1_3_1_)    
  - [Join Extra tabular data to the GeoDataFrame](#toc1_4_)    
  - [Spatial Queries](#toc1_5_)    
    - [Point Query](#toc1_5_1_)    
    - [Polygon Query](#toc1_5_2_)    
  - [Next Section](#toc1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[2-01: Vector Data Analysis](#toc0_)

Learn to read **vector data** from formats like **GeoJSON, Shapefile, and KML**. 

This section covers basic spatial analysis with **GeoDataFrames**, including **spatial queries, joins**, and **plotting**. 

You'll also calculate key spatial attributes for **points, lines**, and **polygons**, and explore methods for analysing **point density, intersections**, and **buffers**.

## <a id='toc1_1_'></a>[Vector](#toc0_)

<div align="center">
    <img src="../images/points-lines-polygons.png">
    <br>Source: https://michaelminn.net/tutorials/python-areas/
</div>

Step 0: Import the libraries

In [None]:
%pip install geopandas

In [None]:
import geopandas as gpd
import pandas as pd

### <a id='toc1_1_1_'></a>[Point](#toc0_)

#### <a id='toc1_1_1_1_'></a>[Creating single point](#toc0_)

The `Point` class from the `shapely.geometry` module is used to create point geometries.

In [None]:
from shapely.geometry import Point

**Geometry** is the property that defines the geographic location(s) of a GeoDataFrame.

For a DataFrame without a geometry column, you can create/specify a geometry column from the existing columns:

In [None]:
pt = {'col1': ['name1'], 'coordinate': [Point(1, 2)]}
pt_gdf = gpd.GeoDataFrame(pt, geometry='coordinate')
pt_gdf

In [None]:
pt_gdf.plot()

In [None]:
pt = {'col1': ['name1'], 'geometry': [Point(1, 2)]}
pt_gdf = gpd.GeoDataFrame(pt)
pt_gdf

#### <a id='toc1_1_1_2_'></a>[Creating a point collection](#toc0_)

Task: Add another point (Point(2, 1)) to the point collection

*Because GeoDataFrame is an extended format of DataFrame, we can use same function `pd.concat()` within pandas to append the GeoDataFrame*

In [None]:
# Add another point (Point(2, 1)) to the point collection pt_gdf
pts_gdf = pd.concat([pt_gdf, gpd.GeoDataFrame({'col1': ['name2'], 'geometry': [Point(2, 1)]})])
pts_gdf

Simply plot the point use the plot function

In [None]:
pts_gdf.plot()

### <a id='toc1_1_2_'></a>[Read Local Files as GeoDataFrame](#toc0_)

```python
file_path = "path_to_file/your_geospatial"
file_data = gpd.read_file(file_path)
```

Refer to [Official Document: Read files](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html)

GeoPandas can read most common types of vector data:

1. **Shapefile**, developed by **Esri**.

2. **GeoJSON**: A lightweight format based on JSON, often used for web mapping and data exchange.

3. **KML (Keyhole Markup Language)**: Commonly used with **Google Earth** for visualising geographic data.

4. **GPKG (GeoPackage)**: A modern, open standard format that supports both vector and raster data.

#### <a id='toc1_1_2_1_'></a>[Read point data from file (ESRI Shapefile)](#toc0_)

We are reading the file of metro stations in Singapore:

In [None]:
data_path = '../data/raw/part_ii/'
data_name = 'mrt_sg/MRT_LRT_Stations.shp'
full_path = data_path + data_name
full_path

In [None]:
metro_sg = gpd.read_file(full_path)

In [None]:
type(metro_sg)

In [None]:
metro_sg.head()

In [None]:
metro_sg.plot()

## <a id='toc1_2_'></a>[Line](#toc0_)

In [None]:
from shapely.geometry import LineString

You can see a line as a series of points.

Create a line from coordinates of two points

In [None]:
line = {'col1': ['name1'], 'geometry': [LineString([(1, 2), (2, 1)])]}

line_gdf = gpd.GeoDataFrame(line)

line_gdf

In [None]:
line_gdf.plot()

### <a id='toc1_2_1_'></a>[Read line data from file (GeoJSON)](#toc0_)

We will import cycling path network of Singapore.

Data source: https://data.gov.sg/collections/359/view

In [None]:
data_name = 'cycling_path_network.geojson'
full_path = data_path + data_name
full_path

In [None]:
cycle_sg = gpd.read_file(full_path)

In [None]:
type(cycle_sg)

In [None]:
cycle_sg.plot()

For line shape, there are more possible attributes to explore. For example, length of each cycling path:

In [None]:
# calculate the length of each line in the cycle_sg GeoDataFrame
cycle_sg['length'] = cycle_sg.length

In [None]:
cycle_sg.head()

There is **a warning** that the data is in geographic coordinate reference system. 

The calcuation of length will be in unit of degree and potentially biased. 

Therefore, we need to reproject it to projected reference system.

**Geographic and Projected** Coordinate Referece Systerm (CRS)

It is important that you are working with **the correct CRS**.

<div align="center">
    <img src="../images/gcs_pcs.png">
    <br>Source: https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/gcs_vs_pcs/
</div>

Anyway, with **projected CRS**, we can express the location in the **unit: metres**.

*Note:* You can go to this website: [Interactive Album of Map Projections 2.0 (psu.edu)](https://projections.mgis.psu.edu/), check the difference of projected CRSs. Different CRS may introduce bias. For example, you can go to [How big the world actually is: The true size](https://www.thetruesize.com/) to check the distortion it gives to real size.

EPSG is the unique ID linked to a specific coordinate reference system.

The common coordinate systems and their EPSG codes

- EPSG: 4326
- EPSG: 3857
- EPSG: 7789

Recommended EPSG codes for Singapore:

[Coordinate reference systems for "Singapore" (epsg.io)](https://epsg.io/?q=Singapore)

In [None]:
# reproject the cycle_sg GeoDataFrame to EPSG:3414
cycle_sg = cycle_sg.to_crs("EPSG:3414")

In [None]:
# calculate the length of each line in the cycle_sg GeoDataFrame
cycle_sg['length'] = cycle_sg.length
cycle_sg.head()

In [None]:
# colour the cycle_sg GeoDataFrame by the length of each line
cycle_sg.plot(column='length', legend=True)

### <a id='toc1_2_2_'></a>[Create line from point (From MRT station to MRT line)](#toc0_)

In [None]:
metro_sg = gpd.read_file(data_path + 'mrt_sg/MRT_LRT_Stations_seqed.shp')

metro_sg.plot()
metro_sg.head()

In [None]:
# Convert 'mrt_sequence' to numeric values to ensure proper sorting
metro_sg['mrt_sequen'] = pd.to_numeric(metro_sg['mrt_sequen'], errors='coerce')

metro_sg = metro_sg.sort_values(['mrt_line', 'mrt_sequen'])

metro_sg

In [None]:
#import LineString from shapely.geometry
from shapely.geometry import LineString

In [None]:
# Group by MRT line
pd.DataFrame(metro_sg.groupby('mrt_line'))

In [None]:
for mrt_line, group in metro_sg.groupby('mrt_line'):
    # Sort points by sequence
    group_sorted = group.sort_values(by='mrt_sequen')

    list_of_points = group_sorted.geometry.tolist()

    print("Metro stations of Each Line:", list_of_points)

In [None]:
for mrt_line, group in metro_sg.groupby('mrt_line'):
    # Sort points by sequence
    group_sorted = group.sort_values(by='mrt_sequen')

    list_of_points = group_sorted.geometry.tolist()

    line = LineString(list_of_points)

    print("Line:", line)

In [None]:
from pprint import pprint

lines = []

for mrt_line, group in metro_sg.groupby('mrt_line'):
    # Sort points by sequence
    group_sorted = group.sort_values(by='mrt_sequen')
    # Get the list of points
    list_of_points = group_sorted.geometry.tolist()

    line = LineString(list_of_points) # Create a LineString from the points

    lines.append({'mrt_line': mrt_line, 'geometry': line})

pprint(lines)

In [None]:
lines = []

for mrt_line, group in metro_sg.groupby('mrt_line'):
    
    group_sorted = group.sort_values(by='mrt_sequen')

    list_of_points = group_sorted.geometry.tolist()

    line = LineString(list_of_points)
    # Append the LineString with its MRT line
    lines.append({'mrt_line': mrt_line, 'geometry': line})

metro_lines = gpd.GeoDataFrame(lines, crs = metro_sg.crs)

metro_lines

In [None]:
metro_lines.plot()

In [None]:
metro_lines.to_file('../data/processed/part_ii/mrt_sg/metro_lines.shp')

## <a id='toc1_3_'></a>[Polygon](#toc0_)

In [None]:
from shapely.geometry import Polygon

In [None]:
d = {'col1': ['name1'], 'geometry': [Polygon([(1, 2), (2, 1), (2, 2)])]}

gdf = gpd.GeoDataFrame(d, crs="EPSG:4326")

gdf

In [None]:
gdf.plot()

### <a id='toc1_3_1_'></a>[Read polygon data from file (ESRI shapefile)](#toc0_)

In [None]:
data_name = 'planningarea_sg/sg_planning_area_nosea.shp'
full_path = data_path + data_name
full_path

In [None]:
planningarea_sg = gpd.read_file(full_path)

In [None]:
type(planningarea_sg)

In [None]:
planningarea_sg.head()

In [None]:
planningarea_sg.plot()

`Challenge 1`: Can you colour different planning area by different colour?

Hint: We coloured cycling path on length. The syntax is similar.

In [None]:
#————————————————————————————————————————————————#

#————————————————————————————————————————————————#

## <a id='toc1_4_'></a>[Join Extra tabular data to the GeoDataFrame](#toc0_)

`Challenge 2`: Read the CSV file of income by planning area in Singapore

The file path is `../data/raw/part_ii/income_sg/income.csv`

In [None]:
#————————————————————————————————————————————————#

#————————————————————————————————————————————————#

In [None]:
planningarea_sg.head(2)

Join the attribute tables based on the **common column**

The common field is 'PLN_AREA_N' or 'Name'.

In [None]:
merge = pd.merge(planningarea_sg, income, on='PLN_AREA_N', how='left')

merge.head()

You can export the GeoDataFrame as ESRI Shapefile for future use:

In [None]:
# export the merge GeoDataFrame to a shapefile
merge.to_file('../data/processed/part_ii/planningarea_income_sg.shp')

## <a id='toc1_5_'></a>[Spatial Queries](#toc0_)

<div align="center"><img src="../images/spatialqueries.jpg"><br>Source: https://doi.org/10.1080/10095020.2022.2163924</div>

### <a id='toc1_5_1_'></a>[Point Query](#toc0_)

In [None]:
# The coordinates of SUTD is 103.96239544519815, 1.3406916475105508
query_pt = Point(103.96239544519815, 1.3406916475105508)


# Find the planning area that contains the query point

planningarea_sg[planningarea_sg.contains(query_pt)]

In [None]:
planningarea_tampines = planningarea_sg[planningarea_sg.contains(query_pt)]

### <a id='toc1_5_2_'></a>[Polygon Query](#toc0_)

`Task`: Find the cycling path that intersects with the Tampines planning area.

In [None]:
print(type(planningarea_tampines.geometry.values[0]))
planningarea_tampines.geometry.values[0]

In [None]:
cycle_sg[cycle_sg.intersects(planningarea_tampines.geometry.values[0])]

Why there is nothing intersected between these two files?

It is the unmatched CRS!

In [None]:
cycle_sg.crs == planningarea_tampines.crs

In [None]:
planningarea_tampines_projected = planningarea_tampines.to_crs("EPSG:3414")

In [None]:
cycle_sg.crs == planningarea_tampines_projected.crs

In [None]:
cycle_tampines = cycle_sg[cycle_sg.intersects(planningarea_tampines_projected.geometry.values[0])]

In [None]:
cycle_tampines.plot()

## <a id='toc1_6_'></a>[Next Section](#toc0_)

Go to [2-02: Raster Analysis](./2-02_raster.ipynb)