# Future Shorelines

Exploratory notebook for shoreline prediction using satellite derived shoreline-position data. This notebook is specifically used to explore and experiment with the data of this future shoreline prediction project. The full project can be found at: https://github.com/florisrc/ShorePred 



## Configure notebook

Notebook is meant to be connected to the ShorePred GitHub repository (https://github.com/florisrc/ShorePred). This notebook is designed having the following workflow in mind: 

1. Mount Colab to drive.
2. Clone the remote GitHub repo to Colab.
3. Copy GitHub repo to Colab.
4. Create temp work directory with GitHub files in Colab. 
5. Save nb changes to Colab nb in drive.
6. Clone remote GitHub to temp Colab directory. 
7, Sync changes from drive to temp Colab directory. 
8. Commit changes to remote GitHub directory. 

In the following few cells this framework is set up, while helper functions are provided. 

Please note that it requires a configuration file including github credentials: 

``` 
{"repository": "***", "user": "***", "password": "***", "email": "***"}
```
Furthermore the configuration file should also include gcloud credentials if buckets are used. 


Furthermore the notebook should be saved manually before running ```git_prepare_commit()``` and ```git_commit()``` functions if notebook changes should be included in commit. 


## Directory & authentification configurations

Set file names, paths,  mount drive and authenticate to cloud storage. 

In [0]:
from google.colab import drive, auth
from os.path import join

# directory configs
ROOT = '/content/drive'     # default for the drive
PROJ = 'ds-thesis'       # name of project 
CONFIG_FILE = ROOT + '/My Drive/personal/config.json' # path to git configs
PROJECT_PATH = join(ROOT, 'My Drive/' + PROJ)

auth.authenticate_user()        # authenticate user cloud storage account
drive.mount(ROOT)       # mount the drive at /content/drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Configure cloud


In [0]:
import json

with open(CONFIG_FILE, 'r') as f:
  gcloud_config = json.load(f)['gcloud_config']   # load configurations
GS_PROJECT_ID  = gcloud_config['project_id']
GS_DATA_BUCKET = gcloud_config['data_bucket']

!gcloud config set project "{GS_PROJECT_ID}"   # set project

Updated property [core/project].


## Helper functions to set up Colab & GitHub integration. 



In [0]:
import json

def clone_github_repo(config_file, targ_dir='', r = "ShorePred"):
  """Clone GitHub repository. """
  with open (config_file, 'r') as f:
    git_config = json.load(f)['git_config']
    # r = git_config['repository']
    u = git_config['user']
    p = git_config['password']
    !git clone  https://{u}:{p}@github.com/{u}/{r}.git {targ_dir}

def cp_proj_2_drive():
  """Copy files to drive."""
  !cp -r /content/"{PROJ}"/* "{PROJECT_PATH}"

def prepare_git_commit(*args):
  """Sync GitHub repository with Drive. Please save this notebook first if 
  the changes of this notebook should be included in the commit. """
  %cd /content/
  !mkdir ./temp
  clone_github_repo(CONFIG_FILE, targ_dir='./temp')
  !rsync -av --exclude=data/ --exclude=big_data/ --exclude=report/ "{PROJECT_PATH}"/* ./temp

def git_commit(config_file, commit_m='commited from colab nb', branch='master', commit_f='.'):
  """Commit all changes after safe."""
  with open (config_file, 'r') as f:
    git_config = json.load(f)['git_config']
  u  = git_config['user']
  e = git_config['email']
  %cd /content/temp
  !git config --global user.email "{e}"
  !git config --global user.name "{u}" 
  !git add "{commit_f}"
  !git commit -m "{commit_m}"
  !git push origin "{branch}"
  %cd /content
  !rm -rf ./temp

## Clone github repository

In [0]:
%cd /content
!mkdir "{PROJECT_PATH}"  # in case we haven't created it already
!mkdir ./temp
clone_github_repo(CONFIG_FILE, targ_dir='temp') # clone git repo using repo config file 
!cp -r ./temp/* "{PROJECT_PATH}"
!rm -rf ./temp
!mkdir "{PROJ}"
!rsync -av --exclude=.idea/ "{PROJECT_PATH}"/* "{PROJ}"

/content
mkdir: cannot create directory ‘/content/drive/My Drive/ds-thesis’: File exists
Cloning into 'temp'...
remote: Enumerating objects: 50, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (37/37), done.[K
remote: Total 785 (delta 25), reused 29 (delta 13), pack-reused 735[K
Receiving objects: 100% (785/785), 64.49 MiB | 34.34 MiB/s, done.
Resolving deltas: 100% (443/443), done.
mkdir: cannot create directory ‘ds-thesis’: File exists
sending incremental file list
README.md
application.py
classic_models.ipynb
clean_data_final.ipynb
clean_sds.ipynb
colab_nb.ipynb
environment.yml
es_rnn_colab_nb_example.ipynb
explore_es_rnn.ipynb
explore_models.ipynb
explore_sds.ipynb
explore_sds2.ipynb
main.py
n-beats.ipynb
nb_exploration.ipynb
nb_exploration2.ipynb
nb_update_w4.ipynb
requirements.txt
big_data/report/
big_data/report/ACM-Reference-Format.bbx
big_data/report/ACM-Reference-Format.bst
big_data/report/ACM-Reference-Format.cbx
big_data/report/

## Load raw data

In [0]:
# choose to copy all or individual file
%cd /content/"{PROJ}"
!mkdir gcloud_data
!gsutil -m cp -r gs://"{GS_DATA_BUCKET}"/* /content/"{PROJ}"/gcloud_data

/content/ds-thesis
mkdir: cannot create directory ‘gcloud_data’: File exists
Copying gs://future-shorelines-data/sample-data-clean-500m.csv...
Copying gs://future-shorelines-data/sds.csv...
/ [2/2 files][  2.0 GiB/  2.0 GiB] 100% Done  69.5 MiB/s ETA 00:00:00           
Operation completed over 2 objects/2.0 GiB.                                      


In [0]:
import pandas as pd
df_raw_init = pd.read_csv(f'/content/{PROJ}/gcloud_data/sds.csv')
df_raw = df_raw_init.copy()

## Demonstrate initial cleaning
Cleaning of dataframe is demonstrated here with a sample because processing the whole dataset requires very large amount of RAM-memory. Large amounts of RAM are required because the dataset will be exploded (unnested) which implies going from 1.8 million rows to about ~50 million rows.  

In [0]:
example = df_raw.sample(1000)

## Optimize dataframe memory 

Dataframe consists of 1.8 million rows, while most of them also consit of nested lists which have to be exploded. To avoid RAM processing errors we optimize the structure that holds the data. 

In [0]:
from typing import List

def optimize_floats(df: pd.DataFrame) -> pd.DataFrame:
    floats = df.select_dtypes(include=['float64']).columns.tolist()
    df[floats] = df[floats].apply(pd.to_numeric, downcast='float')
    return df


def optimize_ints(df: pd.DataFrame) -> pd.DataFrame:
    ints = df.select_dtypes(include=['int64']).columns.tolist()
    df[ints] = df[ints].apply(pd.to_numeric, downcast='integer')
    return df


def optimize_objects(df: pd.DataFrame, ignore_features: List[str]) -> pd.DataFrame:
    for col in df.select_dtypes(include=['object']):
        if col not in ignore_features:
            num_unique_values = len(df[col].unique())
            num_total_values = len(df[col])
            if float(num_unique_values) / num_total_values < 0.5:
                df[col] = df[col].astype('category')
    return df

def optimize(df: pd.DataFrame, ignore_features: List[str] = []):
    return optimize_floats(optimize_ints(optimize_objects(df, ignore_features)))

example = optimize(example, ['dt', 'dt2', 'dist', 'dist2', 'outliers_1', 'outliers_2'])

## Tokenize 

The shoreline positions data is holded in a string which contains a nested list. Therefore we first tokenize the string; later these will have to be transformed to floats. To avoid duplication of data we will distinguish from here on between time series data and their meta-data. 

In [0]:
from tqdm.auto import tqdm

tqdm.pandas()

def tokenize(string_of_list):
  return string_of_list[1:-1].split(', ')

def str2flt(string_of_list):
  try: 
    return [float(x) for x in string_of_list[1:-1].split(', ')]
  except: 
    return 'NotConverted'

def create_tokenized_tsdf(df):
  df['dt'] = df['dt'].progress_apply(str2flt)
  df['dist'] = df['dist'].progress_apply(str2flt)
  return df

example = create_tokenized_tsdf(example)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




## Filter 

1. First we will filter transects without observations; these can be masked by identifying ```df['dt']" == NotConverted"```; 
2. Then we will mask all non-sandy transects;
3. Finally we will mask all transects with ```changerate_unc > 0.5```.

In [0]:
print(f"Transects original df: {len(example['transect_id'].unique())}")
example = example[(example['dt']!= 'NotConverted') & (example['dist']!='NotConverted')]
print(f"Transects with observations: {len(example['transect_id'].unique())}")
example = example.loc[example['flag_sandy']==True]    # keep only sandy transects
print(f"Transects flag sandy df: {len(example['transect_id'].unique())}")
example = example.loc[example['changerate_unc']<0.5]    # keep transects with relatively constant trends
print(f"Transects changerate_unc < 0.5 df : {len(example['transect_id'].unique())}")

Transects original df: 1000
Transects with observations: 965
Transects flag sandy df: 352
Transects changerate_unc < 0.5 df : 237


## Explode (unnest) shoreline positions



In [0]:
import numpy as np

def unnesting(df, explode):
  idx = df.index.repeat(df[explode[0]].str.len())
  df1 = pd.concat([
      pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
  df1.index = idx

  return df1.join(df.drop(explode, 1), how='left')

example = unnesting(example, ['dt', 'dist'])

## Convert to datetime

In the raw data the dates are expressed in decimals. Here we convert them to datetime types, where we use 1984-01-01 as start date. 

In [0]:
from datetime import timedelta, datetime
from tqdm.auto import tqdm

tqdm.pandas()

def partial2date(number, reference_year=1984):
    year = reference_year + int(number)
    d = timedelta(days=(reference_year + number - year)*365)
    day_one = datetime(year, 1, 1)
    date = d + day_one
    return date

example['ts'] = example['dt'].progress_apply(partial2date)


HBox(children=(FloatProgress(value=0.0, max=6744.0), HTML(value='')))




## Add geometry object

To facilitate quick processing with Geopandas it is recommended to add a geometry Point object to the dataframe


In [0]:
from shapely.geometry import Point


def merge2point(lon, lat):
    return Point(lon, lat)

def add_geometry(df):
  df['geometry'] = list(map(merge2point, df['Intersect_lon'], df['Intersect_lat']))
  return df

example = add_geometry(example)

## Format dataframe

Finally format dataframe. We use transect id as principal index and the timestamp as secondary one. 

In [0]:
example = example.set_index(['transect_id', 'ts'])
example.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,dt,dist,country_id,continent,country_name,changerate,changerate_unc,flag_sandy,no_shorelines,RMSE,outliers_1,outliers_2,Timespan,intercept,intercept_unc,no_sedcomp,low_detect_shlines,err_changerate,err_timespan,Start_lon,Start_lat,Intersect_lon,Intersect_lat,End_lon,End_lat,coastline_idint
transect_id,ts,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
BOX_028_049_247,1986-01-01 12:21:06.457258,2.00141,586.466698,CHL,South America,Chile,0.442936,0.457762,True,20.0,13.258553,"[1, 12]",[],30.0,588.476379,10.65117,1.0,1.0,1.0,1.0,-72.885475,-53.613033,-72.877121,-53.614998,-72.868774,-53.616962,6605.0
BOX_028_049_247,1992-01-01 01:26:20.558111,8.000164,860.857273,CHL,South America,Chile,0.442936,0.457762,True,20.0,13.258553,"[1, 12]",[],30.0,588.476379,10.65117,1.0,1.0,1.0,1.0,-72.885475,-53.613033,-72.877121,-53.614998,-72.868774,-53.616962,6605.0
BOX_028_049_247,1998-01-01 14:30:37.293162,14.001656,579.622902,CHL,South America,Chile,0.442936,0.457762,True,20.0,13.258553,"[1, 12]",[],30.0,588.476379,10.65117,1.0,1.0,1.0,1.0,-72.885475,-53.613033,-72.877121,-53.614998,-72.868774,-53.616962,6605.0
BOX_028_049_247,1999-01-01 08:41:39.205635,15.000992,599.540983,CHL,South America,Chile,0.442936,0.457762,True,20.0,13.258553,"[1, 12]",[],30.0,588.476379,10.65117,1.0,1.0,1.0,1.0,-72.885475,-53.613033,-72.877121,-53.614998,-72.868774,-53.616962,6605.0
BOX_028_049_247,2001-01-01 21:02:45.662584,17.002403,594.319097,CHL,South America,Chile,0.442936,0.457762,True,20.0,13.258553,"[1, 12]",[],30.0,588.476379,10.65117,1.0,1.0,1.0,1.0,-72.885475,-53.613033,-72.877121,-53.614998,-72.868774,-53.616962,6605.0


## Commit to GitHub

In [0]:
prepare_git_commit()
git_commit(CONFIG_FILE, commit_m='Final cleaning example.')