### Mission
In this notebook our aim is to create a coral blaching predictor based on time-series data for a specific geographical region.
1) The first part concentrates on using geospatial libraries and a library for mulitdimensional data - xarray. Here you will learn how to manipulate Sea Surface Temperature data using xarray. Here you will also learn how to visualize the data. 

2) The second part will load a pre-trained model to make predictions in areas where bleaching data is not available. The predictions are made using the timeseries data in the first part. 

### Libraries
Libraries that you should have a glance at  
* odp - [The Ocean data Platfrom SDK](https://odp-sdk-python.readthedocs.io/en/master/)
* geopandas - [Geopandas](https://geopandas.org/en/stable/getting_started/introduction.html)
* xarray - [Xarray](https://docs.xarray.dev/en/stable/)

We have to pip install tensorflow for the second part of ths notebook

In [None]:
%pip install tensorflow

In [None]:
import geopandas as gpd
import pandas as pd
import json
import xarray as xr

import odp.geospatial as odp
import geopandas as gpd
import pandas as pd
import matplotlib.dates as md
import dateutil
import numpy as np

import sqlite3

import matplotlib.pyplot as plt

import cartopy.crs as ccrs
from cartopy.mpl.ticker import LongitudeFormatter, LatitudeFormatter
import cartopy.feature as cfeature
import cmocean

import matplotlib.cm as cm
from math import pi, sqrt
import pickle
import os
import warnings
warnings.filterwarnings("ignore")
import azure.storage.blob 
import zarr
import altair as alt
import hvplot.xarray
import cmocean

In [None]:
# instantiate the Ocean data platform database
db=odp.Database()
# instantiate plotting tools
db_plt = odp.PlotTools()

### Retrieving time series data for Sea Surface Temp
Necessary code to connect to Azure storage to retrieve timeseries for Sea Surface Temperature. This is data import broiler plate - you do not have to understand what is going on here to finish the notebook

In [None]:
blob_service_client=azure.storage.blob.BlobServiceClient.from_connection_string(os.environ['ODE_CONNECTION_STR'])

In [None]:
container_client = blob_service_client.get_container_client('crw')

In [None]:
container='crw'
folder='zarr/'

In [None]:
file_list = list(set([b.name for b in container_client.walk_blobs(folder, delimiter='/')  ]))
file_list.sort()

In [None]:
%%time

store_list=[]
for year in range(1985,2023):
    result = list(filter(lambda x: "_"+str(year)+"_" in x, file_list))
    for file in result:
            store=zarr.ABSStore(prefix=file,client=container_client)
            store_list.append(store)
temp_data=xr.open_mfdataset(store_list, parallel=True, engine='zarr')

### The oupt is an [xarray](https://docs.xarray.dev/en/stable/) dataset which is a multi-dimensional, in memory, array database

In [None]:
temp_data

### Import the polygon of Florida and Northern Caribbean using geopandas
As most of our data is in the Florida and Northern Caribbean region we have decided to concentrate on this area. We have create a premade geojson file containing the coordinates a the drawn polygon that represents the area. 

You can feel free to create your own bounding box and explore temperatures in those areas if you like.

In [None]:
poly = gpd.read_file('boundary.geojson')
poly.head()

In [None]:
poly['geometry'][0]

### Figure out the bounding box for our area of interest
A bounding box is the smallest rectangle that contains all of the given points in the selected region.<br>Since our temperature data has information on what region the temperature is measured in we want to make sure that we only select data from that specific area. <br>In order to do this we need to find the boundrary coordinates for the area.


In [None]:
coords = list(poly["geometry"][0].envelope.exterior.coords)
coords

### Take a slice of the temperature data only for the area we care about (the bounding box we created)
xarray has a built in slicing tool that allows us to take just a slice of all the data

In [None]:
ds_slice =temp_data.sel(lon=slice(coords[0][0],coords[1][0]), lat=slice(coords[0][1], coords[2][1]))
ds_slice

## Let's take a look at 2005, a year with a lot of coral bleaching in the caribbean
Again, xarray has built-in functionality that allows you to take a specific time slice

In [None]:
ds_2005 = ds_slice.isel(time=(ds_slice.time.dt.year == 2005))

### With built in xarray functions we can easily visualize the data

In [None]:
monthly_means = ds_2005.groupby("time.month").mean()
fg = monthly_means.analysed_sst.plot(
    col="month",
    col_wrap=4,
    cmap=cmocean.cm.thermal,
)

### And even see it play over time!

In [None]:
ds_2005.hvplot(
    groupby="time",
    clim=(15, 35),
    widget_type="scrubber",
    widget_location="bottom",
)

### Challenge

What observations can we draw from the data visualizations over the course of a year?
<br>Can we compare the same data from different years to see how they differ?
<br>In 2010, the florida keys had a severe cold front that resulted in major coral death. Can this be seen through this data?

### We can also turn the xarray into a pandas dataframe that more people are familiar with

Start by taking a small subset of the xarray to seeed up our processing. Then we will split the datetime object from the xarray into sepearate time columns so that we easily can filter the data

In [None]:
ds_slice =temp_data.sel(lon=slice(-80,-78), lat=slice(20, 22), time=slice('1990-01-01', '2022-01-01'))
ds_slice

This next step takes a bit (around 5 mins), might be a good time to grab a coffee, or check out this really [cool google earth site about coral bleaching!](https://earth.google.com/web/@24.4430141,123.8161774,-0.51676057a,500d,35y,10.51093386h,0t,0r/data=CkoSSBIgY2EwYzk0ZGNhN2I4MTFlN2I1ZDBiNzRhMWFlNGU2MDMiJGVmZWVkX29jZWFuX2FnZW5jeV9jb3JhbF9ibGVhY2hpbmdfMQ)

In [None]:
%%time
df = ds_slice.to_dataframe().reset_index() ## This can take some time depending on size of slice
df['time'] = pd.to_datetime(df['time'],format='%m/%d/%y %I:%M%p')
df['mnth_yr'] = df.time.dt.to_period('M').astype(str)
df['year'] = df.time.dt.year.astype(str)
df['month'] = df.time.dt.month.astype(int)

In [None]:
df.head()
print(len(df))

In [None]:
temp_by_mnth_yr_df = df.groupby(['mnth_yr', 'year','month']).agg({'analysed_sst': ['mean', 'min', 'max']}).reset_index()
temp_by_mnth_yr_df.columns=['mnth_yr','year','month','mean','min','max']


In [None]:
temp_by_mnth_yr_df

Creating a plot with a dataframe requires some more code, here we create a chart that displays the montly mean temprature and the average temprature per year

In [None]:
selection = alt.selection_multi(fields=['year'], bind="legend")
chart = alt.Chart(temp_by_mnth_yr_df).mark_line().encode(
            x=alt.X("month", title="Month", sort="ascending"),
            y=alt.Y('mean', title='Monthly mean temperature',scale=alt.Scale(domain=(22,33))),
            color=alt.Color('year', title="Year",
                            scale=alt.Scale(domain=temp_by_mnth_yr_df["year"].unique(), scheme="paired")),
            opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
            # tooltip=[alt.Tooltip('mean', title='Mean Temperature')],
            tooltip=['mean', 'year'],
)


line = alt.Chart(pd.DataFrame({'mean': [30.5]})).mark_rule().encode(y='mean', color=alt.value("#FF0000"),strokeWidth=alt.value(3))

alt.layer(chart, line).configure_view(
    stroke='transparent'
).properties(
            title='Monthly average per year',width=500,height=300
        ).configure_axis(
            labelFontSize=15,
            titleFontSize=15
        ).configure_legend(labelFontSize=12, columns=1, labelLimit=500, symbolLimit=100).add_selection(
            selection
        ).add_selection(
            selection
        )

We can also look at a specific year and add a visual check if the temperature is above 30.5 degrees - an important treshold for corals

In [None]:
df_plot = temp_by_mnth_yr_df[temp_by_mnth_yr_df.year =='2005']

plt.figure(figsize=(15,7))


plt.plot(df_plot['mnth_yr'],df_plot['mean'], 'g')
plt.axhline(y =30.5, color = 'r', linestyle = '--')


plt.legend()
plt.scatter(df_plot['mnth_yr'],df_plot['mean'],c='g',label='mean temp')


plt.xlabel('mnth_yr',size=14)
plt.ylabel('temperature($^\circ C$ )',size=14)
plt.xticks(rotation = 45)
plt.xticks(df_plot['mnth_yr'])
ax=plt.gca()
ax.axis([0,11,-40,40])
plt.title('Mean temperature',size=14)
plt.legend(loc=0)

plt.show()


Since we also have information on the max and min temps, we can create boundaraies to better understand the temprature deviation for a month. 

In [None]:
plt.figure(figsize=(15,7))

plt.plot(df_plot['mnth_yr'],df_plot['max'], '--r')
plt.plot(df_plot['mnth_yr'],df_plot['min'], '-.b')
plt.plot(df_plot['mnth_yr'],df_plot['mean'], 'g')
plt.scatter(df_plot['mnth_yr'],df_plot['max'],c='r',label='max temp')
plt.scatter(df_plot['mnth_yr'],df_plot['min'],c='b',label='min temp')
plt.scatter(df_plot['mnth_yr'],df_plot['mean'],c='g',label='mean temp')

plt.xlabel('mnth_yr',size=14)
plt.ylabel('temperature($^\circ C$ )',size=14)
plt.xticks(rotation = 45)
plt.xticks(df_plot['mnth_yr'])
ax=plt.gca()
plt.gca().fill_between(df_plot['mnth_yr'], 
                       df_plot['max'], df_plot['min'], 
                       facecolor='#9D59F4', 
                       alpha=0.35)
plt.title('Maximum, minimum and mean temperature',size=14)
plt.legend(loc=0)

plt.show()


#### Challenge
Grouping this giant area together is taking quite some liberties and is not the best way to represent the data. Pick a more granular area and look at the trends. (Below there is some information or specific coral reef sites)


## Now let's try to combine temperature data with coral bleaching data

Let's read the bleaching data

In [None]:
df_bl = pd.read_csv('bleaching_data.csv')
df_bl['longitude'] = df_bl.Longitude_Degrees.round()
df_bl['latitude'] = df_bl.Latitude_Degrees.round()

df_bl.head()

Frist we want to look at what year contains the most data on bleaching

In [None]:
df_bl[['Date_Year','Bleaching_Level']].groupby('Date_Year').count()

Geopandas is a powerful library to use when working with data that contains geographical information. Lets convert the coral bleaching dataframe from above to a GeoDataFrame object
This will create a 'geometry' column that is easy to plot

In [None]:
gdf = gpd.GeoDataFrame(
    df_bl, geometry=gpd.points_from_xy(df_bl.Longitude_Degrees, df_bl.Latitude_Degrees))
gdf.head(5)

### Plot bleaching samples by year

In [None]:
db_plt.plot_points(gdf, col='Date_Year')

### Plot bleaching samples by Bleaching Level

In [None]:
db_plt.plot_points(gdf.where(gdf['Bleaching_Level'] > 0), col='Bleaching_Level')

#### Challenge

Plot another column that could be an intersting feature to look closer at.

### Timeseries classification

As seen in the plots from the bleaching database, there are areas without any samples.<br>
In this section we are using a pre-trained binary classification model to predict if 
an area has bleached corals based on the sea surface temperatue timeseries of the areas locations.<br>
Creating and training the model is not part of this workshop, but if you would like to see how it was done and/or further improve it after the workshop, 
the code is in a notebook called `timeseries_classification.ipynb`. 

In [None]:
#Getting sea surface temperature timeseries for an area without samples in the bleaching database
ds_slice_predict =temp_data.sel(lon=slice(-80,-75), lat=slice(18, 22), time=slice('2000-01-01', '2020-01-05'))

Again, this next step takes a bit (around 5 mins), might be a good time to grab a coffee, or check out this really [cool google earth site about coral bleaching!](https://earth.google.com/web/@24.4430141,123.8161774,-0.51676057a,500d,35y,10.51093386h,0t,0r/data=CkoSSBIgY2EwYzk0ZGNhN2I4MTFlN2I1ZDBiNzRhMWFlNGU2MDMiJGVmZWVkX29jZWFuX2FnZW5jeV9jb3JhbF9ibGVhY2hpbmdfMQ)

In [None]:
%%time
df = ds_slice_predict.to_dataframe().reset_index() 

In [None]:
df.head()

In [None]:
#Get a list of sea surface temperature readings for all locations (lat-lon group)
df['analysed_sst'] = df['analysed_sst'].fillna(0)
df_grp= df.groupby(by=['lat', 'lon']).agg({'analysed_sst':lambda x: list(x)})

In [None]:
#remove locations where all the temperatures are NaN or 0
df_grp['sst_sum'] = df_grp.apply(lambda row: sum(row['analysed_sst']), axis = 1)
df_grp = df_grp.where(df_grp['sst_sum'] > 0).dropna()

#### Scale the temperatures
Input variables should be normalized to have a normal distribution, since most ML models assume this of the data

In [None]:
#Load the same sklearn StandardScaler as used when training the model
scaler = pickle.load(open('scaler.pkl','rb'))

In [None]:
def data_scaler(data_list):
    scaled_array = scaler.transform(np.array(data_list).reshape(-1, 1))
    return scaled_array.reshape(scaled_array.shape[0]).tolist()

In [None]:
#Scale the all the temeratures in the list
df_grp['sst_scaled'] = df_grp.apply(lambda row: data_scaler(row['analysed_sst']), axis = 1)

#### Import/install tensorflow and load the model

In [None]:
import sys
#!{sys.executable} -m pip install tensorflow
from tensorflow import keras
model = keras.models.load_model("./ts_classification_model.h5")

#### Use model to predict whether corals on location are bleached or not

In [None]:
def data_predict(data_list):
    x_array = np.array(data_list[3000:7000]).reshape(1, 4000, 1)
    y_pred = model.predict(np.array(x_array))
    return np.argmax(y_pred, axis = 1)[0]

In [None]:
##This step also takes a few minutes
df_grp['bleached_predict'] = df_grp.apply(lambda row: data_predict(row['sst_scaled']), axis = 1) #This takes some time

In [None]:
df_grp_l = df_grp.reset_index()

In [None]:
df_grp_geo = gpd.GeoDataFrame(
    df_grp_l, geometry=gpd.points_from_xy(df_grp_l['lon'], df_grp_l['lat']))

### Plot the predictions on a map

The plot shows locations (betweeen South Cuba and North Jamacia) where corals are predicted to be bleached (1, dark color) and not bleached (0, yellow color).
Not all locations have coral reefs, this is not taken into considerations. 

In [None]:
db_plt.plot_points(df_grp_geo, col='bleached_predict')

What did you learn with this notebook?<br>
What else could be done with this data?<br>