# Future Shorelines

Exploratory notebook for shoreline prediction using satellite derived shoreline-position data. This notebook is specifically used to explore and experiment with the data of this future shoreline prediction project. The full project can be found at: https://github.com/florisrc/ShorePred 



## Configure notebook

Notebook is meant to be connected to the ShorePred GitHub repository (https://github.com/florisrc/ShorePred). This notebook is designed having the following workflow in mind: 

1. Mount Colab to drive.
2. Clone the remote GitHub repo to Colab.
3. Copy GitHub repo to Colab.
4. Create temp work directory with GitHub files in Colab. 
5. Save nb changes to Colab nb in drive.
6. Clone remote GitHub to temp Colab directory. 
7, Sync changes from drive to temp Colab directory. 
8. Commit changes to remote GitHub directory. 

In the following few cells this framework is set up, while helper functions are provided. 

Please note that it requires a configuration file including github credentials: 

``` 
{"repository": "***", "user": "***", "password": "***", "email": "***"}
```
Furthermore the configuration file should also include gcloud credentials if buckets are used. 


Furthermore the notebook should be saved manually before running ```git_prepare_commit()``` and ```git_commit()``` functions if notebook changes should be included in commit. 


## Directory & authentification configurations

Set file names, paths,  mount drive and authenticate to cloud storage. 

In [None]:
from google.colab import drive, auth
from os.path import join

# directory configs
ROOT = '/content/drive'     # default for the drive
PROJ = 'shoreline-forecasting'       # name of project 
CONFIG_FILE = ROOT + '/My Drive/personal/config.json' # path to git configs
PROJECT_PATH = join(ROOT, 'My Drive/' + PROJ)

auth.authenticate_user()        # authenticate user cloud storage account
drive.mount(ROOT)       # mount the drive at /content/drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Configure cloud


In [None]:
import json

with open(CONFIG_FILE, 'r') as f:
  gcloud_config = json.load(f)['gcloud_config']   # load configurations
GS_PROJECT_ID  = gcloud_config['project_id']
GS_DATA_BUCKET = gcloud_config['data_bucket']

!gcloud config set project "{GS_PROJECT_ID}"   # set project

Updated property [core/project].


## Helper functions to set up Colab & GitHub integration. 



In [None]:
import json

def clone_github_repo(config_file, targ_dir='', r = "shoreline-forecasting"):
  """Clone GitHub repository. """
  with open (config_file, 'r') as f:
    git_config = json.load(f)['git_config']
    # r = git_config['repository']
    u = git_config['user']
    p = git_config['password']
    !git clone  https://{u}:{p}@github.com/{u}/{r}.git {targ_dir}

def cp_proj_2_drive():
  """Copy files to drive."""
  !cp -r /content/"{PROJ}"/* "{PROJECT_PATH}"

def prepare_git_commit(*args):
  """Sync GitHub repository with Drive. Please save this notebook first if 
  the changes of this notebook should be included in the commit. """
  %cd /content/
  !mkdir ./temp
  clone_github_repo(CONFIG_FILE, targ_dir='./temp')
  !rsync -av --exclude=data/ --exclude=big_data/ --exclude=report/ "{PROJECT_PATH}"/* ./temp

def git_commit(config_file, commit_m='commited from colab nb', branch='master', commit_f='.'):
  """Commit all changes after safe."""
  with open (config_file, 'r') as f:
    git_config = json.load(f)['git_config']
  u  = git_config['user']
  e = git_config['email']
  %cd /content/temp
  !git config --global user.email "{e}"
  !git config --global user.name "{u}" 
  !git add "{commit_f}"
  !git commit -m "{commit_m}"
  !git push origin "{branch}"
  %cd /content
  !rm -rf ./temp

## Dependencies

In [None]:
# Important library for many geopython libraries
!apt install -q gdal-bin python-gdal python3-gdal 
# Install rtree - Geopandas requirment
!apt install -q python3-rtree 
# Install Geopandas
!pip install -q git+git://github.com/geopandas/geopandas.git
# Install descartes - Geopandas requirment
!pip install -q descartes 
# Install Folium for Geographic data visualization
!pip install -q folium
# Install plotlyExpress
!pip install -q plotly_express

Reading package lists...
Building dependency tree...
Reading state information...
gdal-bin is already the newest version (2.2.3+dfsg-2).
python-gdal is already the newest version (2.2.3+dfsg-2).
python3-gdal is already the newest version (2.2.3+dfsg-2).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 59 not upgraded.
Reading package lists...
Building dependency tree...
Reading state information...
python3-rtree is already the newest version (0.8.3+ds-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 59 not upgraded.
  Building wheel for geopandas (setup.py) ... [?25l[?25hdone


## Thesis 

In [None]:
%cd /content
!mkdir "{PROJECT_PATH}"  # in case we haven't created it already
!mkdir ./temp
clone_github_repo(CONFIG_FILE, targ_dir='temp') # clone git repo using repo config file 
!cp -r ./temp/* "{PROJECT_PATH}"
!rm -rf ./temp
!mkdir "{PROJ}"
!rsync -av --exclude=.idea/ "{PROJECT_PATH}"/* "{PROJ}"

## Load data

Here it is assumed that pre-processed data is available in the following format: unnested time-series dataframe, compressed meta-data and dataframe with outliers. See ```clean_data_example.ipynb``` for an example. 

In [None]:
%cd {PROJ}
!mkdir utils
!cp src/processing/logger.py utils/

/content/shoreline-forecasting
mkdir: cannot create directory ‘utils’: File exists


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

metadata = pd.read_pickle(f'{ROOT}/My Drive/data/sds_compressed.pkl')
data = pd.read_pickle(f'{ROOT}/My Drive/data/tsdf.pkl')
outliers = pd.read_pickle(f'{ROOT}/My Drive/data/df_outliers.pkl')

## Filter
Filter transects with no lon/lat intersect as this will break the run later. 

In [None]:
import geopandas as gpd
from shapely.geometry import Point

def df2gdf(df):
  crs = {"init": "epsg:4326"}
  return gpd.GeoDataFrame(df, crs=crs, geometry=df['geometry'])

gdf = df2gdf(metadata)
idx = gdf.loc[gdf['geometry'].is_empty == False]['transect_id'].to_list()
metadata = metadata.loc[metadata['transect_id'].isin(idx)]


  return _prepare_from_string(" ".join(pjargs))


## Set configurations

In [None]:
def get_configs():
  
  configs = {}

  configs['default'] = {
      "drop_outliers_1": True,
      "drop_outliers_2": True,
      "sample": 5000,
      "show_plot": True,
      "save_results": True,
      "fname": "sample",
      "log_stats": False
    }
  configs['metadata_filter'] = {
    "flag_sandy": True,
    "changerate_unc": None,
    "no_sedcomp": True,
    "low_detect_shlines": True,
    "err_changerate": True,
    "err_timespan": True
    }
  configs['nan_filter'] = {
      "nans_year_lt": 1,
      "nans_transect_lt": .15      
    }
  return configs


## Helper functions to filter time series

In [None]:
from datetime import timedelta, datetime
from tqdm.auto import tqdm

tqdm.pandas()

def drop_indices(s, outlier_dict):
  idx = outlier_dict[s.name]
  mask = np.ones(len(s), dtype=bool)
  mask[idx] = False
  return s[mask]

def partial2date(number, reference_year=1984):
  year = reference_year + int(number)
  d = timedelta(days=(reference_year + number - year)*365)
  day_one = datetime(year, 1, 1)
  date = d + day_one
  return date

def partials2dates(list_of_part_dates):
  return [partial2date(idx) for idx in list_of_part_dates]

def date2partial(date, reference_year=1984):
  year = date.year
  tt = date.timetuple()
  d = tt.tm_yday/365
  return (year-reference_year) + d
  
def dates2partials(list_of_part_dates):
  return [date2partial(idx) for idx in list_of_part_dates]


## Class to logg statistics


In [None]:
from utils.logger import DataFrameLogger

## Filter with metadata

In [None]:
def get_metadata_filter(df, configs):
  metadata_configs = configs['metadata_filter']
  log_stats = configs['default']['log_stats']
  
  print(f"Transects original df: {len(df['transect_id'].unique())}")
  if log_stats is True:
    logger.get_all_stats_metadata(df, data, label='raw_data')

  
  # loop through configs and filter frame
  for k, v in metadata_configs.items():
    if k == 'flag_sandy' and v == True:
      df = df.loc[df['flag_sandy']==True]
    if k == 'changerate_unc' and v is not None:
      df = df.loc[df['changerate_unc']<v]
    elif v is True:
      df = df.loc[df[k]!=0]
    if log_stats is True:
      logger.get_all_stats_metadata(df, data, label=k)

  print(f"Transects filtered df: {len(df['transect_id'].unique())}")
  return df['transect_id'].unique()

## Filter time series

In [None]:
import numpy as np
from tqdm.auto import tqdm

tqdm.pandas()

def filter_data(df, configs, filt=None):

  # parse configs
  drop_outliers_1 = configs['default']['drop_outliers_1']
  drop_outliers_2 = configs['default']['drop_outliers_2']
  sample = configs['default']['sample']
  log_stats = configs['default']['log_stats']

  # format df
  df = df.set_index(['transect_id'])

  # filter according to metadata filter
  if filt.any():
    df = df[df.index.isin(filt)]
  
  # optionally take sample
  if sample is not None:
    keep = np.random.choice(df.index.unique(), size=sample)
    df = df[df.index.isin(keep)]

  # handle outliers
  df = df.reset_index()
  outliers1 = outliers.set_index('transect_id')['outliers_1_as_int'].to_dict()
  outliers2 = outliers.set_index('transect_id')['outliers_2_as_int'].to_dict()
  if configs['default']['drop_outliers_1'] is True:
    print('Dropping outliers 1...')
    df = df.groupby('transect_id').progress_apply(lambda x: drop_indices(x, outliers1))
    df = df.droplevel('transect_id')
  if configs['default']['drop_outliers_2'] is True:
    print('Dropping outliers 2...')
    df = df.groupby('transect_id').progress_apply(lambda x: drop_indices(x, outliers2))
    df = df.droplevel('transect_id')

  return df

def df2tsdf(df, configs):
  log_stats = configs['default']['log_stats']

  df = df.reset_index() # reset index
  df = df.pivot(index='dt', columns='transect_id', values='dist')   # pivot 
  df = df.reset_index()
  df['ts'] = df['dt'].progress_apply(partial2date)
  df = df.set_index(['ts', 'dt'])

  if log_stats is True:
    logger.get_stats_tsdf(df, label='df2tsdf')
  return df

## NaN Filter

In [None]:
import matplotlib.pyplot as plt


def nan_filter(df, configs):
  
  # parse configs 
  nans_year_lt = configs['nan_filter']['nans_year_lt']
  nans_transect_lt = configs['nan_filter']['nans_transect_lt']
  show_plot = configs['default']['show_plot']
  log_stats = configs['default']['log_stats']

  #  yearly averages
  yearly = df.groupby(df.index.get_level_values('ts').year).mean()
  filtered1 = yearly[yearly.isnull().mean(axis=1) < nans_year_lt]
  filtered2 = filtered1.loc[:, filtered1.isnull().mean() < nans_transect_lt]

  if log_stats is True:
    labels = [f"Before NaN filter",
              f"Filter 1: NaN's per year > {nans_year_lt * 100} %",
              f"Filter 1: NaN's per year > {nans_transect_lt * 100} %"]
    for i, j in zip([yearly, filtered1, filtered2], labels):
      logger.get_stats_tsdf(i, j)


  keep_years = filtered1.index
  keep_transects = filtered2.columns
  keep_rows = df.index.get_level_values('ts').year.isin(keep_years)
      
  if show_plot is True:
    
    # function to calculate nan proportion
    f = lambda x: x.isnull().mean(axis=1)*100
    
    # dataframe to save results
    res = pd.DataFrame(columns=['ts', 'type', 'p_nans'])
    res = res.set_index('ts')
    
    # calculate proportions
    label  = 0
    for i in [yearly, filtered1, filtered2]:
      temp = pd.DataFrame(f(i), columns=['p_nans'])
      temp['type'] = label
      label += 1
      res = pd.concat([res, temp])

    # plot results
    fig, ax = plt.subplots(figsize=(16, 8))
    res = res.pivot(columns='type', values='p_nans')
    res.plot(kind="bar", stacked=False, ax=ax, rot=45, width=.7)

    # format graph
    ax.set_title("Filtering dataset by proportion of NaN's")
    ax.set_ylabel("Proportion NaN's per transect (%)")
    ax.set_xlabel("Time (yrs)")   
    l1 = ax.legend([f'Original data', 
                    f"Filter 1: NaN's per year > {nans_year_lt * 100} ", 
                    f"Filter 2: NaN's per transect > {nans_transect_lt * 100}"], title='Data selection')
    l2 = ax.legend([f"Transects: {len(yearly.columns)}; years: {len(yearly.index)}; NaN's: {yearly.isna().values.sum()}",
                    f"Transects: {len(filtered1.columns)}; years: {len(filtered1.index)}; NaN's: {filtered1.isna().values.sum()}",
                    f"Transects: {len(filtered2.columns)}; years: {len(filtered2.index)}; NaN's: {filtered2.isna().values.sum()}"],
                    loc="center right", title='Statistics')
    plt.gca().add_artist(l1)
    plt.gca().add_artist(l2)
    plt.show()

  if show_plot is True:
  
  # function to calculate observations per transect
    f = lambda x: np.count_nonzero(~np.isnan(x))
    
    # dataframe to save results
    res = pd.DataFrame(columns=['idx', 'type', 'p_n_values'])
    res = res.set_index('idx')
    
    # calculate proportions
    label  = 0
    for i in [yearly, filtered1, filtered2]:
      temp = i.apply(f).value_counts(normalize=True)
      temp = pd.DataFrame(temp, columns=['p_n_values'])
      temp['type'] = label
      label += 1
      res = pd.concat([res, temp])

    # plot results
    fig, ax = plt.subplots(figsize=(16, 8))
    res = res.pivot(columns='type', values='p_n_values')
    res.plot(kind="bar", stacked=False, ax=ax, rot=45, width=.7)
    

    # format graph
    ax.set_title("Observations per transcents (proportionally) ")
    ax.set_ylabel("Number of transects (%)")
    ax.set_xlabel("Number of observations")
    ax.legend([f'Original data', 
               f"Filter 1: NaN's per year < {nans_year_lt * 100} ", 
               f"Filter 2: NaN's per transect < {nans_transect_lt * 100}"], title='Data selection',
              loc="upper left")   
    plt.show()


  return df[keep_rows][keep_transects], keep_years, keep_transects

In [None]:
def handle_nans(df, configs):

  # filter nans
  df, keep_years, keep_transects = nan_filter(df, configs)

  # interpolate
  df = df.groupby(df.index.get_level_values('ts').year).mean()
  df = df.interpolate(method='linear', limit_direction='both', axis=0)
   
  return df, keep_years, keep_transects

## Save results in one pickle object

In [None]:
from datetime import datetime
import pickle

def save_results(configs, tsdf):
  # get configs
  fname = configs['default']['fname']

  # assign unique name and make globally available
  timestamp = int(datetime.timestamp(datetime.now()))
  global filename
  filename = f"{fname}_{timestamp}.pkl"
  fpath = f'{ROOT}/My Drive/data/{filename}'
  print(f"Saving results as: {fpath}")

  # bundle results in one dictionary
  res = {}
  res['configs'] = configs
  res['logger'] = logger.res
  res['metadata'] = metadata.loc[metadata['transect_id'].isin(tsdf.columns)]
  res['tsdf'] = tsdf
  
  # save bundled results in drive 
  with open(fpath, 'wb') as handle:
    pickle.dump(res, handle, protocol=pickle.HIGHEST_PROTOCOL)

## Update selection


In [None]:
def update_data_selection(data, metadata, outliers, configs):
                          
  """
  

  """
  # parse configs
  save2drive = configs['default']['save_results']

  # generate filter with metadata 
  metadata_filter = get_metadata_filter(metadata, configs)
  # filter data according to configs and metadata
  data = filter_data(data, configs, filt=metadata_filter)
  # transform to tsdf (pivot)
  tsdf = df2tsdf(data, configs)
  # deal with nans according to config settings
  tsdf,_,_ = handle_nans(tsdf, configs)
  # save results bundled in drive 
  if save2drive is True:
    save_results(configs, tsdf)
  return tsdf


## Run data selection


In [None]:
def get_configs():
  
  configs = {}

  configs['default'] = {
      "drop_outliers_1": True,
      "drop_outliers_2": True,
      "sample": None,
      "show_plot": True,
      "save_results": True,
      "fname": "sample",
      "log_stats": False
    }
  configs['metadata_filter'] = {
    "flag_sandy": True,
    "changerate_unc": None,
    "no_sedcomp": True,
    "low_detect_shlines": True,
    "err_changerate": True,
    "err_timespan": True
    }
  configs['nan_filter'] = {
      "nans_year_lt": 1,
      "nans_transect_lt": .25      
    }
  return configs

  
configs = get_configs()
logger = DataFrameLogger()
clean_data = update_data_selection(data, metadata, outliers, configs)




Transects original df: 1780547
Transects filtered df: 635551
Dropping outliers 1...


HBox(children=(FloatProgress(value=0.0, max=635551.0), HTML(value='')))

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-70a6c0771c69>", line 31, in <module>
    clean_data = update_data_selection(data, metadata, outliers, configs)
  File "<ipython-input-16-9ae6c3843822>", line 13, in update_data_selection
    data = filter_data(data, configs, filt=metadata_filter)
  File "<ipython-input-12-4d8380dd1c8c>", line 32, in filter_data
    df = df.groupby('transect_id').progress_apply(lambda x: drop_indices(x, outliers1))
  File "/usr/local/lib/python3.6/dist-packages/tqdm/std.py", line 753, in inner
    result = getattr(df, df_function)(wrapper, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py", line 736, in apply
    result = self._python_apply_general(f)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py", line 755, in _python

KeyboardInterrupt: ignored

## Load results

In [None]:
def load_results(fname=filename):
  """Load results from pickl.e Globally set filename as default."""
  fpath = f"{ROOT}/My Drive/data/{fname}"

  with open(fpath, 'rb') as handle:
    res = pickle.load(handle)
  return res

results = load_results(fname=filename)

  projstring = _prepare_from_string(projparams)


In [None]:
filename 

'sample_1593166860.pkl'

In [None]:
logger2 = results['logger']
metadata2 = results['metadata']
tsdf2 = results['tsdf']

logger2.res 


Unnamed: 0_level_0,operation,transects,nans,p_nans
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


## To be continued..

## Commit changes to GitHub

In [None]:
prepare_git_commit()
git_commit(CONFIG_FILE, commit_m='Fixed pickle error (empty gpd).')