<a href="https://colab.research.google.com/github/ICRAR/PHYS5511/blob/master/2019/assignments/assignment_two_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


A "naive" baseline solution. The basic idea is very similar to the fully-connected neural network tutorial that we have developed back in [week 04](https://github.com/ICRAR/PHYS5511/blob/master/2019/week04/Keras_FC_network_classifier.ipynb)


In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

#Set up

The same usual "business" as Assignment One

In [0]:
from google.colab import drive
drive.mount('/content/drive')

##Get the data for the first time
You only need to run this once, and this step can be skipped for subsequent runs

In [0]:
%cd /content/drive/My\ Drive/PHYS5512/data

In [0]:
!mkdir isfog2020

In [0]:
%cd isfog2020

Make sure you already have the *kaggle.json* file, which you probably downloaded from Kaggle for assignment one. If not, you can download it from your Kaggle profile again. Then we copy that to the /root/.kaggle directory

In [0]:
!cp ../../kaggle.json /root/.kaggle/

In [0]:
!kaggle competitions download -c isfog2020-pile-driving-predictions

In [0]:
!ls

##Goto the directory

In [0]:
%cd /content/drive/My Drive/PHYS5512/data/isfog2020

#Pre-process data

This part of the code of checking data is copied from the original kernel.

The dataset is kindly provided by [Cathie Group](http://www.cathiegroup.com).

##Importing data

The first step in any data science exercise is to get familiar with the data. The data is provided in a csv file (```training_data.csv```). We can import the data with Pandas and display the first five rows using the ```head()``` function.

In [0]:
train_df = pd.read_csv("training_data_cleaned.csv")  # Store the contents of the csv file in the variable 'train_df'
train_df.head(10)

The data has 12 columns, containing PCPT data ($ q_c $, $ f_s $ and $ u_2 $), recorded hammer data (blowcount, normalised hammer energy, normalised ENTHRU and total number of blows), pile data (diameter, bottom wall thickness and pile final penetration). A unique ID identifies the location and $ z $ defines the depth below the mudline.

The data has already been resampled to a regular grid with 0.5m grid intervals to facilitate the further data handling. Note that the "grid" axis is the depth.

The hammer energy has been normalised using the same reference energy for all piles in this prediction exercise.

In [0]:
train_df.groupby('Location ID')['Location ID'].agg(['count']).head()

The above code checks how many records per location, here we only show the first 10 locations.

In [0]:
train_locs = train_df['Location ID'].unique()
print(train_locs, len(train_locs))

In [0]:
df_submit = pd.read_csv("sample_submission.csv")
df_submit.head()

The submission contains two columns. The first column is a contatination the location ID of the wind turbine and the pile depth. The second column is the predicted blowcount per meter.

In [0]:
print(len(df_submit), df_submit.columns)

In [0]:
test_df = pd.read_csv("validation_data_cleaned.csv")
test_df.head()

The test dataset 10 columns - the PCPT data ($ q_c $, $ f_s $ and $ u_2 $), **incomplete** hammer data (normalised hammer energy, normalised ENTHRU), pile data (diameter, bottom wall thickness and pile final penetration). 

In [0]:
test_df.groupby('Location ID')['Location ID'].agg(['count']).head()

In [0]:
test_locs = test_df['Location ID'].unique()
print(test_locs, len(test_locs))

The above code checks unique locations included in the test set

In [0]:
dist_df = pd.read_csv("interdistance_data.csv")
dist_df.head()


We will **need to** use the distance information in the next iteration. But for now, we just leave it although we have a few good ideas how to use it. This basically means that we won't be able to use the location information in our initial solution. This is far from optimal, but it is fine to have a basic simple model to get us started quickly. 

##Normalise data

We will use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) in sklearn to normalise the data. The reason we need to normalise the data is clear, as you can see, the value range for each column is very different. We would like to standardise them so that the weights can be tuned in a uniform way without being biased towards absoluate magnitutde inherent in those values.

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
print(train_df.columns)
print()
print(test_df.columns)

Just to make sure we know the sequence of all the column names. We can't afford to make mistakes here (i.e. wrong order, or something)

In [0]:
train_np = train_df.values
X = train_np[:, [0, 1, 2, 3, 7, 8, 10, 11, 12]]
print(X.shape, np.mean(X[:, 2]), np.std(X[:, 2]))
y = train_np[:, [6]]
print(y.shape, np.mean(y), np.std(y))


In [0]:
test_np = test_df.values
X_test = test_np[:, [0, 1, 2, 3, 6, 7, 8, 9, 10]]

In [0]:
from sklearn.preprocessing import StandardScaler

Please check the documentatio on the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Normalisation is essential for any ML projects.

In [0]:
scaler_x, scaler_y = StandardScaler(), StandardScaler()
scaler_x.fit(X)
scaler_y.fit(y)

In [0]:
X = scaler_x.transform(X)

y = scaler_y.transform(y)

For the test dataset, we use a different scaler? Or should we? Need to verify that.

In [0]:
scaler_x_test = StandardScaler()
scaler_x_test.fit(X_test)
X_test = scaler_x.transform(X_test)

Now check again if the data is zero-centred with a unit variance

In [0]:
print(X.shape, np.mean(X[:, 2]), np.std(X[:, 2]))
print(y.shape, np.mean(y), np.std(y))

In [0]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)

In [0]:
print(X_train.shape, X_val.shape)

Check if the validation set still follows the similar distribution

In [0]:
print(np.mean(X_val[:, 2]), np.std(X_val[:, 2]))
print(np.mean(y_val), np.std(y_val))

#Fully-connected ANN model
First, we use the plain linear regression model to understand the problem

In [0]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense,\
                         BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import backend as K

We will use the [notion of CallBacks](https://medium.com/singlestone/keras-callbacks-monitor-and-improve-your-deep-learning-205a8a27e91c) to monitor our training progress, and save the "best" model against the validation set. However, do you think the size of the validation set is appropriate?

In [0]:
callback_checkpoint = ModelCheckpoint(filepath='best_model.h5',
                                        monitor='val_loss',
                                        verbose=1,
                                        save_weights_only=True,
                                        save_best_only=True)

In [0]:
def root_mean_squared_error(y_true, y_pred):
  return K.sqrt(K.mean(K.square(y_pred - y_true)))

model = Sequential(name='FC ANN')
model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu', 
                name='first_hidden'))
#model.add(Dropout(0.5))
model.add(Dense(64, activation='relu', name='second_hidden'))
#model.add(Dropout(0.5))
#model.add(Dense(32, activation='relu', name='third_hidden'))
#model.add(Dropout(0.5))
model.add(Dense(1, name='blowcount', activation='linear'))

model.compile(optimizer='adam',
              loss='mean_squared_error',
              metrics=[root_mean_squared_error])
model.summary()

#Train the model

In [0]:
history = model.fit(X_train, y_train,
                          validation_data=(X_val, y_val),
                          epochs=100, batch_size=32, 
                          callbacks=[callback_checkpoint])

# Now test

First, load the "best" model in terms of the validation error during training

In [0]:
model.load_weights('best_model.h5')

In [0]:
test_pred = model.predict(X_test)

In [0]:
test_pred.shape

In [0]:
test_pred_submit = scaler_y.inverse_transform(test_pred)

In [0]:
test_pred[0:10]

In [0]:
test_pred_submit[0:10]

In [0]:
df_submit['Blowcount [Blows/m]'] = test_pred_submit[:, 0]

In [0]:
df_submit.head()

Please quickly glance your result before submitting it (don't waste your precious submission quota)

In [0]:
df_submit.to_csv('submit_baseline_ann_01.csv', index=False)

#What's next

1. How to add location information
   A promising direction is to "recover" the spatial coordinates of the windturbine based on their pair-wise distances. This might be possible with the [MDS method](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html) to learn the vector embeddings.
2. How to treat PCT test as series of inter-dependent data?