# 1) Loading Data and Creating Maps in Madina

## Madina, what, why and how?
The main motivation for building Madina, is to provide a free, open source environment for researchers and practitioners in urban planning. chosing python as a programming language was the result of the wide adoption of the language, and the available bodsy of open source packages that can be used to work and analize urban data.

One immediate benifit for having an extensive body of open sorced libraries written in the same language, is the ability to write complete analysis workflows that spans all elements of a typical workflow in a single script, offering immense advantages:
1) **Organization**: a complete research workflow written as a script means there is no need for intermediate files for each analysis step, as there is no need to pass the output of a step as input into the next step, likely in a diffferent software. A script-based research project, together with a folder for raw data eliminate the need to store, track, pass, and exchange intermediate results. reducing the chances for errors or mistakes
2) **Non-linear progression**: When using fragmented software to carry out an analyiss, a step that depends on a CAD software must pe carried out completely before starting a following step that depends on GIS, which needs to be carried out completely before starting a statistical analysis in Stata. This sequencial process makes it hard to collaborate, manage files, diagnose and detect mistakes, and most importantly, might result in work repitition if a mistake is made in an earlier stage of the process. Using a script that depends on a raw data folder solve this issue, as many steps could be carried out in parallel by multiple people by using synthatic or sample input. When all the steps are completed, it becomes simple to integrate everything in a single script. Re-running the script after fixing a mistake is significiantly less time consuming than having to repeat multiple tasks, multiple times. 
3) **Transperancy and Reproducibility**: A key advantage for maintaining a single script for a research project, is that every step is explicitly documented. All steps are laid out, and all tools used, their setting, parameters and inputs are documented. This makes it possible for collaborators and the research community at large to inspect the process and help identify any issues. The script, once used in a project with a set of raw input data, could easily be replicated for other urban areas, or for other time periods quickly and easily, once the data is available. Making reserch more effecient, and also making outcomes and results comparable as they come out of an identical process. 

Madina aims to provide a collection of tools and functionalites, by implementing commonly used urban planning methodologies. Madina also aims to reduce the effort needed to use multiple open source libraries. Currently, madina makes it seamless to handle spatial data (Through Geopandas), create origin and destination networks (Through NetworkX), run urban network analysis (Through a custom implementation of UNA), visualize results (through Deck.gl) using very few lines of code. All the formatting and data passing between these packages happens through the Zonal object. Madina's equevelant of a workspace, or a layer management system.

## Creating a Zonal object and Loading Data

In [1]:
import madina as md

cambridge = md.Zonal()

`cambridge` is now a Zonal object. Madina's representation of a workspace. This opject would hold data layers, networks and other data structures needed for urban research workflows. the function `describe()` gives details about the state of the Zonal object

In [2]:
cambridge.describe()

No zonal_layers yet, load a layer using 'load_layer(layer_name, file_path)'
Geographic center: (None, None)
No network graph yet. First, insert a layer that contains network segments (streets, sidewalks, ..) and call create_street_network(layer_name,  weight_attribute=None)
	Then,  insert origins and destinations using 'insert_nodes(label, layer_name, weight_attribute)'
	Finally, when done, create a network by calling 'create_street_network()'


We notice that there is no layers yet. We load our first layer by calling the function `load_layer`. It takes two arguments: 
* `layer_name`: a string that represent a name for the layer. Used to identify layers when they are referenced in other functions.
* `file_path`: a string, or anything the function 

As geopandas's geodataframe is used internally to represent layers, any file format supported by geopandas could be used here. `.shp` and `.geojson` are some of the most widely spatial data formats, are recommended as input files.

In [3]:
cambridge.load_layer(
    name='sidewalks', 
    source='Cities/Cambridge/Data/sidewalks.geojson'
)

In [4]:
cambridge.describe()

Layer name           | Visible | projection | rows  | File path           
sidewalks            |       1 | EPSG:3857  |   170 | Cities/Cambridge/Data/sidewalks.geojson
Geographic center: (-0.014266175861540071, 0.0016269462167978611)
No network graph yet. First, insert a layer that contains network segments (streets, sidewalks, ..) and call create_street_network(layer_name,  weight_attribute=None)
	Then,  insert origins and destinations using 'insert_nodes(label, layer_name, weight_attribute)'
	Finally, when done, create a network by calling 'create_street_network()'


Notice that we now have one layer called sidewalks and has 170 rows. An important thing that happens after loading the first layer, is that the default map centering for visualization is calculated, and you can see it as part of `cambridge.describe()` output. The visualization geographic center is a pair of lattiude and longitude coordinates and could easily be overriden by setting `cambridge.geo_center = (24.77, 46.73)` for instance. To visualize the workspace, call the function `create_map()` 

In [5]:
cambridge.create_map()

This map is produced by [Deck.GL](https://deck.gl/) - [PyDeck](https://pydeck.gl/), a powerful visualization package by passing the layers contained in the Zonal object together with some default settings. layer data inside Madina is maintained in a [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html). a table representation from the python package [Geopandas](https://geopandas.org/). You can access a layer's GeoDataFrame:

In [6]:
cambridge['sidewalks'].gdf

Unnamed: 0_level_0,__Length,__GUID,geometry
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,53.328770,1faf3b03-2e30-44b2-8b28-f84da30193c4,"LINESTRING (-1719.114 147.249, -1718.693 124.0..."
1,33.137771,1956c1b1-6c7b-46c6-be18-630210c0c086,"LINESTRING (-1705.465 177.057, -1714.466 163.1..."
2,82.471466,7a8f2a5b-e209-4b06-9c03-19df15c2e86c,"LINESTRING (-1635.552 240.490, -1555.262 259.331)"
3,20.707448,65e6f380-1774-4439-9478-d23c97aa8346,"LINESTRING (-1648.164 226.821, -1647.328 234.0..."
4,60.851523,a77163e9-5762-457c-8dda-99b4cfb29da4,"LINESTRING (-1662.239 245.651, -1696.978 295.612)"
...,...,...,...
165,15.383524,db2d5c42-357e-4ef4-ac0d-c01414466159,"LINESTRING (-1651.361 256.529, -1662.239 245.651)"
166,22.520863,b561650f-19dc-4496-a312-b9ecfebf9fbd,"LINESTRING (-1651.361 256.529, -1635.552 240.490)"
167,21.412938,53692ac7-ccd5-4135-a75b-58540acdb01a,"LINESTRING (-1680.556 208.784, -1691.978 226.897)"
168,23.915390,0097e461-14b4-4198-9d46-91766699850b,"LINESTRING (-1741.957 154.332, -1719.114 147.249)"


We'll learn more about manipulating GeoDataFrames later in this chapter. 

## Coordinate Reference Systems (CRS) and Projections
When dealing with urban data, the user must be familiar with [Coordinate Reference Systems (CRS)](https://en.wikipedia.org/wiki/Coordinate_reference_system). Two importat CRS types to know are:
* [Geographic coordinate systems](https://en.wikipedia.org/wiki/Geographic_coordinate_system): The most common projection system. They use a pair of latitude and longitude coordinates in degrees from the equator and the prime meridian. The most recognized geographic coordinate system is World Geodetic System (WGS) `EPSG:4326`. This is the coordinate system used in GPS, and in most navigation and mapping software. Geographic Coordinate systems should not be used directly in cartesian distance calculation. Deck.gl, the visualization library used in madina, expects files to be in this CRS, and needed conversions arem handled internally.

* [Projected coordinate system](https://en.wikipedia.org/wiki/Projected_coordinate_system): projected coordinates, are the result of using a [map projection](https://en.wikipedia.org/wiki/Map_projection) to convert the curved surface of the earth into a flat representation. Any projection method entails a loss of accuricy that varies in magnitude based on each map projection and location. It is very important to use a projected coordinate system that works best in the area of interest. Each projected coordinate system is assigned a distance usnit. For instance, the recommended projectied coordinate system [for use in Massachussetts](https://www.mass.gov/info-details/learn-about-massgis-data) is the "Massachusetts State Plane Coordinate System, Mainland Zone meters" `EPSG:26986`. Notice that this CRS is in meters, and all data reported in [MassGIS](https://www.mass.gov/info-details/massgis-data-layers) is in this CRS. Familiarize yourself with the recommended CRS for use in your area of interest, and try avoiding less accurate, but global CRSs such as "WGS 84 / Pseudo-Mercator" `EPSG:3857` frequently used by global map providers such as Google Maps, OpenStreetMap, Bing, and ESRI. Geopandas, the package that handles spatial data representation, assumes the data is in a projected CRS, and would report measurements in the same units used in the given CRS.  

Due to the variation across potential datasets, Madina would not re-project any layer to insure consistency, it would issue a warning. The user should be responsible to ensure all data layers are in an appropriate CRS before attempting any analysis. Now, load the `buildings` and `subway` layers. Notice that it is not strictly necissary to explicitly mention the argument names `layer_name` and `file_path` if you list the inputs for `load_layer()` in the correct order. Always reference the documentation to ensure the right order of parameters, or explicitly specify parameter names. 

In [7]:
cambridge.load_layer('buildings', 'Cities/Cambridge/Data/building_entrances.geojson')
cambridge.load_layer('subway', 'Cities/Cambridge/Data/subway.geojson')
cambridge.describe()

Layer name           | Visible | projection | rows  | File path           
sidewalks            |       1 | EPSG:3857  |   170 | Cities/Cambridge/Data/sidewalks.geojson
buildings            |       1 | EPSG:3857  |   118 | Cities/Cambridge/Data/building_entrances.geojson
subway               |       1 | EPSG:3857  |     2 | Cities/Cambridge/Data/subway.geojson
Geographic center: (-0.014266175861540071, 0.0016269462167978611)
No network graph yet. First, insert a layer that contains network segments (streets, sidewalks, ..) and call create_street_network(layer_name,  weight_attribute=None)
	Then,  insert origins and destinations using 'insert_nodes(label, layer_name, weight_attribute)'
	Finally, when done, create a network by calling 'create_street_network()'


We notice that we have a `buildings` layer with 118 building entrances, and a `subway` layer with 2 subway stations. Lets look at the map:


In [8]:
cambridge.create_map()

Once each layer is loadedm it gets assigned a random color, which could result in less-than-ideal visuals. if you look at the documentation, you'll notice that the function `create_map()` can take three arguments:
* `layer_list`: This parameter takes a list of dictionaries of the form [{...}, {...}, ...]. Each dictionary in this list represent a layer. each key:value pair in the dictionary represent a visualization parameter name, and a parameter setting. These parameters are used internally to prepare each layer's Geodataframe, which is then passed to create a Deck.GL layer with the corresponding settings. This is a list of strings that can be used as dictionary keys, with appropriate value options:
    * `layer`: the value can be the name of one of the layers contained in the `Zonal` object. you can get a list of layers by calling `cambridge.describe()` or `cambridge.layers.layers`
    * `gdf`: the value can be a GeoDataFrame object. This allows visualzing data not inside your `Zonal` object, or data that had been processed or filtered for instance. We'll learn more about handling GeoDataFrames in the next section.
    * `color`: the value can be a list of three numbers between 0 and 255 representing the RGP color. for instance `[0, 0. 255]` is blue.
    * `color_by_attribute`: the value can be one of the layer/gdf attributes (i.e. column names). You can get a list of a layer's column names by calling `cambridge['sidewalks'].gdf.columns`, or by hovering over any layer's visualized geometries when calling `cambridge.create_map()`
    * `color_method`: there are four coloring methods:
        * `single_color`: This is the default setting and you don't need to specify `'color_method':'single_color'` if  `color` is set. If color is not set, a new random color is assigned. 
        * `categorical`: This coloring method is suitable for categorical data with a few unique values. if `color` not assigned, each unique value is assigned a random color. You can assign specific coloors to individual unique values by setting `color` to be a dictionary like `'color`: {'value_1': [255, 0, 0], 'value_2': [0, 255, 0], 'value_3':[0, 0, 255]} tp assign red to all geometries with 'value_1', green to all geometries with 'value_2' and blue to all geometries with 'value_3'. You can get a list of unique values inside a layer's column by calling `cambridge['layer_name'].gdf['column_name'].unique()`
        * `gradient`: This coloring method is suitable for numerical data, where the highest value is set to green, and the lowest value is set to red. The scale is gradual and could easily be skewed by extremne value.
        * `quantile`: This coloring method is suitable for numerical data, where instead of using the numerical value, each entry is assigned its percentile, the highest ranking value is set to green, and the lowest value is set to red. The scale is not sensitive to extremne values, as values are converted into ranked percentiles between 0 and 1. the median value would be yellow.
    * `radius`: if the layer/gdf contains points, setting this parameter to a column name would resize points acccording to values of that column. Must be numerical values only. 
    * `width`: if the layer/gdf contains lines/polylines, setting this parameter to a column name would resize line widths acccording to values of that column. Must be numerical values only. 
    * `opacity`: a number between 0 and 1 to indicate the layer/gdf's opacity level, with 0 meaning fully transperant, and 1 meaning fully opaque
    * `text`: setting this to a column name would overlay text annotations on each geometry.

* `save_as`: Maps are not saved by default. if this parameter is set to a file name `save_as='cambridge_map.html'`, it would save an HTML version of the map.
* `basemap`: False by default. if set to True, it would enable Deck.gl's default base map, currently [Carto](https://carto.com/basemaps)

This is an example of how to use these visualization settings:

In [9]:
cambridge.create_map(
    layer_list=[
        {
            'layer': 'sidewalks',
            'color_by_attribute': '__Length',
            'color_method': 'quantile'
        }, 
        {
            'layer': 'buildings',
            'color_by_attribute': 'people',
            'color_method':'gradient',
            'radius': 'people',
            'radius_min': 1, 
            'radius_max': 6,
        }, 
        {
            'layer': 'subway',
            'color': [0, 200, 255],
            'text': 'id'
        }
    ], 
    save_as='cambridge_map.html', 
    basemap=True
)

## Manipulating GeoDataFrames
Geopandas is a powerful package and provide +functionalities that rivals those of a typical GIS system, sometimes with more flexibility as many functionalities could incorporate more complex and customized operations. Geopandas over geometric manipulation, set operations and aggrigation functionalities that would come in handy in many urban planning applications. 

Most operations in Geopandas create a new dataframe as a result. If you want to manipulate a layer's dataframe, be sure to assign the result back to the layer. 

As an example, we create a new attribute called "building_size", and set it to small if less than 25 people live in that builing, and large if 25 people or more live in that building. This is a simple operation, the aim is to show the sequence: retrieve - process - assign back for manipulating GeoDataFrames in Madina.

In [10]:
# retrieve geodataframe
buildings_gdf = cambridge['buildings'].gdf

# do some processing
buildings_gdf['building_size'] = buildings_gdf['people'].apply(lambda x: 'small' if x < 25 else 'large')

# assign back to layer
cambridge['buildings'].gdf = buildings_gdf

This is a good opprutunity to illusturate setting specified individual colors to each categorical value. When "building_type" is "small", buildings are colored in red, 'large' is assigned blue.

In [11]:
cambridge.create_map(
    [
        {'layer': 'sidewalks', 'color': [100, 100, 100]}, 
        {
            'layer': 'buildings',
            'color_by_attribute': 'building_size',
            'color_method':'categorical',
            'color': {'small': [200, 100, 0], 'large': [0, 100, 200]},
            'text': 'people'
        }, 
    ]
)