# Session 3-1: Data Science for Sustainability (Finally!) 🛰️🌍

![ntl](./assets/ntl.jpg)

Data science for sustainability is a ***broad*** topic. But generally, ansering questions about how to create a more sustainabile future we need two types of data: human and environment. These types of data are inharently geospatial because they **map** human and environment phenomena on planet earth. [<span class="codeb">Geographic Information Systems</span>](https://en.wikipedia.org/wiki/Geographic_information_system) allow for visualizing, manipulating, and analyzing human and environmental geographic data. But GIS platforms have limited utility because (1) it can be difficult to reproduce work flows and (2) they may not be able to process large quantities of data efficent. Further, GIS platforms tend to be a black box that do not allow you to fully understand how your data is being processed. 

Thankfully, open-source data science evangelists have developed a suite of geospatial data science packages – such as [<span class="codeb">GeoPandas</span>](https://geopandas.org) – in Python that build on [Numpy](https://numpy.org), [<span class="codeb">Pandas</span>](https://pandas.pydata.org), and other commonly used Python packages. As such, many of the data structures and functions are similar for packages like <span class="code">Geopandas</span> as they are in [<span class="codeb">Pandas</span>](https://geopandas.org). 

In this session, we will overview how GeoSpatial data can be analysized in Python. Those of you who have a background in GIS will notice many parallels with ArcGIS and QGIS. The advantage here, is you will have budding cababilities to build your own GIS, but with Python. 
 
<p style="height:1pt"> </p>

<div class="boxhead2">
    Session Topics
</div>

<div class="boxtext2">
<ul class="a">
    <li> 📌 Introduction to <span class="codeb">matplotlib.pyplot</span> </li>
    <ul class="b">
        <li> Anatomy of a plot </li>
    </ul>
    <li> 📌 Basic plotting </li>
    <ul class="b">
        <li> Line plots using <code>plt.plot()</code> </li>
        <li> Scatter plots using <code>plt.scatter()</code> </li>
    </ul>
    <li> 📌 Keyword arguments </li>
    <ul class="b">
        <li> Colors </li>
        <li> Linestyles </li>
        <li> Markers </li>
        <li> Explicit definitions vs. shortcuts </li>
    </ul>    
    <li> 📌 Axes settings </li>
    <ul class="b">
        <li> Limits, labels, and ticks </li>
        <li> Legends + titles </li>
    </ul>
    <li> 📌 Subplots + multiple axes </li>
    <ul class="b">
        <li> <span class="code">Figure</span> vs. <span class="code">Axes</span> methods </li>
    </ul>
    <li> 📌 Working with real data </li>
    
    
</ul>
</div>

<hr style="border-top: 0.2px solid gray; margin-top: 12pt; margin-bottom: 0pt"></hr>

### Instructions
We will work through this notebook together. To run a cell, click on the cell and press "Shift" + "Enter" or click the "Run" button in the toolbar at the top. 

<p style="color:#408000; font-weight: bold"> 🐍 &nbsp; &nbsp; This symbol designates an important note about Python structure, syntax, or another quirk.  </p>

<p style="color:#008C96; font-weight: bold"> ▶️ &nbsp; &nbsp; This symbol designates a cell with code to be run.  </p>

<p style="color:#008C96; font-weight: bold"> ✏️ &nbsp; &nbsp; This symbol designates a partially coded cell with an example.  </p>

<hr style="border-top: 1px solid gray; margin-top: 24px; margin-bottom: 1px"></hr>

## Introduction to GeoPandas

<img src="./assets/geopands.png">



GeoPandas is an open-source Python library that ascribes geographic information to Pandas Series DataFrame objects. In other words, enables a Pandas to have a spatial dimention akin to a .shp file in a GIS platform. GeoPandas 


GeoPandas is an open source project to add support for geographic data to pandas objects. It currently implements GeoSeries and GeoDataFrame types which are subclasses of pandas.Series and pandas.DataFrame respectively. GeoPandas objects can act on shapely geometry objects and perform geometric operations.



NumPy, an abbreviation for *Numerical Python*, is the core library for scientific computing in Python. In addition to manipulation of array-based data, NumPy provides an efficient way to store and operate on very large datasets. In fact, nearly all Python packages for data storage and computation are built on NumPy arrays. 

This exercise will provide an overview of NumPy, including how arrays are created, NumPy functions to operate on arrays, and array math. While most of the basics of the NumPy package will be covered here, there are many, many more operations, functions, and modules. As always, you should consult the [NumPy Docs](https://docs.scipy.org/doc/numpy/reference/index.html) to explore its additional functionality.

Before jumping into NumPy, we should take a brief detour through importing libraries in Python. While most packages we will use – including NumPy – are developed by third-parties, there are a number of "standard" packages that are built into the Python API. The following table contains a description of a few of the most useful modules worth making note of.

| Module | Description | Syntax |
| :----- | :---------- | :----- |
| <a href="https://docs.python.org/3.8/library/os.html" style="text-decoration: none; font-family: Lucida Console, Courier, monospace; font-weight: bold"> os </a> | Provides access to operating system functionality | <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> import os </span> |
| <a href="https://docs.python.org/3.8/library/math.html" style="text-decoration: none; font-family: Lucida Console, Courier, monospace; font-weight: bold"> math </a> | Provides access to basic mathematical functions | <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> import math </span> |
| <a href="https://docs.python.org/3.8/library/random.html" style="text-decoration: none; font-family: Lucida Console, Courier, monospace; font-weight: bold"> random </a> | Implements pseudo-random number generators for various distributions | <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> import random </span> |
| <a href="https://docs.python.org/3.8/library/os.html" style="text-decoration: none; font-family: Lucida Console, Courier, monospace; font-weight: bold"> datetime </a> | Supplies classes for generating and manipulating dates and times | <span style="font-family: Lucida Console, Courier, monospace; font-weight: bold"> import datetime as dt </span> |

While 2D cartisian space (e.g. latitude and logitutue) is often sufficent for analysis, most human and environmental data is time-varing, meaning that a third dimention is often present in these datasets. Sometimes,

In [None]:
#### Dir Paths
PATH = 'data/'

In [None]:
#### Depedencies
import os
from glob import glob
import numpy as np
import pandas as pd
import geopandas as gpd
import rasterio 
from rasterstats import zonal_stats


### Merge socioeconomic data

In [None]:
#### Open shape files
neighborhoods_fn = PATH+'la_county/la_county.shp'
neighborhoods = gpd.read_file(neighborhoods_fn)

In [None]:
#### Get socioeconomic data
se_dir = PATH+'socioeconomic/'
fn_out = 'ESM203_F2020_SocioEcon'

# get names 
df_out = neighborhoods[['Name']]
df_out.rename(columns={'Name':'NEIGHBORHOOD'}, inplace=True) # rename col

# loop through files and write out csv
for fn in os.listdir(se_dir):
    
    col = fn.split('LA-')[1].split('.csv')[0] # Get col name
    df = pd.read_csv(se_dir+fn) # open the fn

    if df.shape[1] == 3:
        df_out = df_out.merge(df.iloc[:,1:3], on = 'NEIGHBORHOOD', how = 'left') # merge 
    
    elif df.shape[1] == 4: # crime
        df_out = df_out.merge(df.iloc[:,1:4], on = 'NEIGHBORHOOD', how = 'left') # merge
        df_out.rename(columns = {'PER CAPITA' : "CRIME PER CAPITA"}, inplace = True)
        df_out.rename(columns = {'TOTAL' : "CRIME TOTAL"}, inplace = True)

# write csv
df_out.to_csv(PATH+fn_out+'.csv', index = False)

# write shape file
gdf_out = neighborhoods[['Name','geometry']]
gdf_out.rename(columns={'Name':'NEIGHBORHOOD'}, inplace=True) # rename col
gdf_out = df_out.merge(gdf_out, on = 'NEIGHBORHOOD', how = 'right')
gdf_out = gpd.GeoDataFrame(gdf_out)
gdf_out.to_file(PATH+fn_out+'.shp', index = False)


### Make NDVI and LST from Landsat Scenes
Will do this for two scences one spring and one summer.

In [None]:
def ndvi(b4_fn, b5_fn, out_fn):
    """Funciton writes an NDVI image from Landsat 8. Will throw an error for 0 values in Landsat edges.
    Args:
        b4_fn = path to Landsat8 band 4 (red) geotif
        b5_fn = path to Landsat8 band 5 (NIR) geotif
        fn_out = path and name to write out ndvi file
    """
    
    meta = rasterio.open(b4_fn).meta
    meta.update({'dtype': 'float32'})
    band4 = rasterio.open(b4_fn).read(1) #Red
    band5 = rasterio.open(b5_fn).read(1) #NIR
    
    # NDVI = (NIR — VIS)/(NIR + VIS) 
    ndvi = np.nan_to_num((band5 - band4)/(band5 + band4))
    ndvi = np.float32(ndvi) # reduce size
    
    # write our raster to disk
    with rasterio.open(out_fn, 'w', **meta) as out:
        out.write_band(1, ndvi)

    print('NDVI done')

In [None]:
def bright_temp(b_fn, fn_out, radiance_mult, radiance_add, k1, k2):
    
    """ Function writes a tif for Landsat8 brigthtness temp from DN. Note, this is not land surface tempature.
    Args:
        b_fn = file name for TIRS band
        fn_out = path and file name to write .tif
        radiance_mult, radiance_add, k1, k2 = all come from the Landsat8 Level 1 XXX_MTL.txt file
    """
    # read & meta
    meta = rasterio.open(b_fn).meta
    meta.update({'dtype': 'float32'})
    b = rasterio.open(b_fn).read(1)
    
    # Calculate TOA reflectance from DN:
    toa  = (b * radiance_mult) + radiance_add
    
    # TOA to brightness temp from K to C
    bright = (k2 / np.log(k1 / (toa +1)) - 273.15)
    bright = np.float32(bright)
    
    # Drop Brightness values >50C
    bright[bright >= 50] = np.nan
    
    # write our raster to disk
    with rasterio.open(fn_out, 'w', **meta) as out:
        out.write_band(1, bright)

    print('Brightness temp done')

In [None]:
#### Make NDVI -- Summer 2020-08-20
data_in = PATH+'landsat/Level1/LC08_L1TP_041036_20200820_20200905_01_T1/' 
b4 = data_in+'LC08_L1TP_041036_20200820_20200905_01_T1_B4.TIF'
b5 = data_in+'LC08_L1TP_041036_20200820_20200905_01_T1_B5.TIF'
out = PATH+'interim/NDVI_20200820.tif'
ndvi(b4, b5, out)

In [None]:
#### Make Brightness temp -- Summer 2020-08-20
data_in = PATH+'landsat/Level1/LC08_L1TP_041036_20200820_20200905_01_T1/'
b_fn = data_in+'LC08_L1TP_041036_20200820_20200905_01_T1_B10.TIF'
fn_out = PATH+'interim/BrightTemp_20200820.tif'
radiance_mult = 3.3420E-04
radiance_add = 0.10000
k1 = 774.8853
k2 = 1321.0789
bright_temp(b_fn, fn_out, radiance_mult, radiance_add, k1, k2)

In [None]:
#### Make NDVI -- Spring 2020-04-14
data_in = PATH+'landsat/Level1/LC08_L1TP_041036_20200414_20200423_01_T1/' 
b4 = data_in+'LC08_L1TP_041036_20200414_20200423_01_T1_B4.TIF'
b5 = data_in+'LC08_L1TP_041036_20200414_20200423_01_T1_B5.TIF'
out = PATH+'interim/NDVI_20200414.tif'
ndvi(b4, b5, out)

In [None]:
#### Make Brightness temp -- Spring 2020-04-14
data_in = PATH+'landsat/Level1/LC08_L1TP_041036_20200414_20200423_01_T1/'
b_fn = data_in+'LC08_L1TP_041036_20200414_20200423_01_T1_B10.TIF'
fn_out = PATH+'interim/BrightTemp_20200414.tif'
radiance_mult = 3.3420E-04
radiance_add = 0.10000
k1 = 774.8853
k2 = 1321.0789
bright_temp(b_fn, fn_out, radiance_mult, radiance_add, k1, k2)

### Run zonal stats

In [None]:
def zonal(rst_in, polys_in, do_stats): 
    """Function will run zonal stats on a raster and a set of polygons. All touched is set to True by default. 
    
    Args:
        rst_in = file name/path of raster to run zonal stats on
        polys = either list of shape files (watersheds) or single shape file (countries)
        do_stats = stats to use, see rasterstats package for documention, (use sume)
    """
    
    # switch crs
    polys_in = polys_in.to_crs({'init' :'epsg:32611'}) # CRS of Landsat tifs
    
    # Run Zonal Stats
    zs_feats = zonal_stats(polys_in, rst_in, stats= do_stats, geojson_out=True, all_touched=True)
        
    # Turn into geo data frame and rename column
    zgdf = gpd.GeoDataFrame.from_features(zs_feats, crs=polys_in.crs)
    
    return zgdf

In [None]:
polys_in =neighborhoods[['Name', 'geometry']]

In [None]:
#### Run Zonal stats - Throws error for Catalina Island which isn't in the scene 
rst_in = PATH+'interim/NDVI_20200414.tif'
ndvi_spring = zonal(rst_in, polys_in, 'mean')
ndvi_spring.rename(columns = {'mean' : 'NDVI_SPRING'}, inplace = True)

rst_in = PATH+'interim/NDVI_20200820.tif'
ndvi_summer = zonal(rst_in, polys_in, 'mean')
ndvi_summer.rename(columns = {'mean' : 'NDVI_SUMMER'}, inplace = True)

rst_in = PATH+'interim/BrightTemp_20200414.tif'
temp_spring = zonal(rst_in, polys_in, 'mean')
temp_spring.rename(columns = {'mean' : 'TEMP_SPRING'}, inplace = True)

rst_in = PATH+'interim/BrightTemp_20200820.tif'
temp_summer = zonal(rst_in, polys_in, 'mean')
temp_summer.rename(columns = {'mean' : 'TEMP_SUMMER'}, inplace = True)


### Merge Final Data

In [None]:
#### Merge it all together
df_list = [ndvi_spring, ndvi_summer, temp_spring, temp_summer]

for df in df_list:
    df.rename(columns = {'Name' : 'NEIGHBORHOOD'}, inplace = True)
    df_out = df_out.merge(df.iloc[:,1:3], on = 'NEIGHBORHOOD', how = 'left')

In [None]:
df_out

In [None]:
# Write CSV
fn_out = 'ESM203_F2020_Ass1'
df_out.to_csv(PATH+fn_out+'.csv', index = False)

In [None]:
# Write shape file
gdf_out = neighborhoods[['Name','geometry']]
gdf_out.rename(columns={'Name':'NEIGHBORHOOD'}, inplace=True) # rename col
gdf_out = df_out.merge(gdf_out, on = 'NEIGHBORHOOD', how = 'right')
gdf_out = gpd.GeoDataFrame(gdf_out)
gdf_out.to_file(PATH+fn_out+'.shp', index = False)

#### Test visualizations

In [None]:
df_out.columns

In [None]:
import matplotlib.pyplot as plt
plt.scatter(df_out['NDVI_SUMMER'], np.log(df_out['CRIME PER CAPITA']))
plt.xlim([0,.4])
