<h1>Neural Network - Average Driver Pay<h1>

This notebook is modelled off the Neural Network notebook offered by University of Melbourne tutor Lucas Fern
https://github.com/lucas-fern/MAST30034-wk4-NNs

In [1]:
from tensorflow import keras
from tensorflow.keras.layers import Dense, Normalization

In [3]:
from sklearn.model_selection import train_test_split
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import sum,avg,max,min,mean,count
import numpy as np

In [4]:
spark = (
    SparkSession.builder.appName("Neural Network - Avg Pay")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "8g")
    .config("spark.sql.parquet.enableVectorizedReader", False)
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

22/08/19 16:48:11 WARN Utils: Your hostname, Sens-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.0.11 instead (on interface en0)
22/08/19 16:48:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/08/19 16:48:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/19 16:48:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [5]:
sdf = spark.read.parquet('../data/curated/combined_data')

                                                                                

<h3>Model Fitting and Prediction<h3>

In [6]:
df = sdf.groupby('Date', 'Hour', 'PU_Location_ID').agg(avg("Temperature_C").alias("Temperature_C"), \
                                                           avg("Humidity_%").alias("Humidity_%"), \
                                                           avg("Speed_kmh").alias("Speed_kmh"), \
                                                           avg("Precip_Rate_mm").alias("Precip_rate_mm"), \
                                                           avg("Driver_pay").alias("Avg_driver_pay"), \
                                                           avg("Day_of_week").alias("Day_of_week"), \
                                                           count('Temperature_C').alias("Num_trips")).toPandas()

                                                                                

In [9]:
# One hot encoding
df = pd.get_dummies(df, columns=['Hour', 'PU_Location_ID', 'Day_of_week'])

In [14]:
df.shape

(2177404, 299)

In [15]:
# Make sure that model will be testing predictions on future dates by setting shuffle to false
TARGET_COLS = ['Avg_driver_pay']

train, test = train_test_split(df, train_size=0.8, shuffle = False)

X_train, y_train = train.drop(TARGET_COLS, axis=1).drop(['Num_trips', 'Date'], axis=1), train[TARGET_COLS]
X_test, y_test = test.drop(TARGET_COLS, axis=1).drop(['Num_trips', 'Date'], axis=1), test[TARGET_COLS]

In [18]:
X_train.shape

(1741923, 296)

In [17]:
X_test.shape

(435481, 296)

In [11]:
# Normalise so high magnitude features don't have preference
norm_layer = Normalization()
norm_layer.adapt(X_train)

Metal device set to: Apple M1


2022-08-16 21:08:51.974295: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-08-16 21:08:51.974449: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-08-16 21:08:57.751183: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-08-16 21:08:57.867136: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2022-08-16 21:08:57.885777: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


In [12]:
model = keras.Sequential(
    [   
        norm_layer,
        Dense(10, activation='relu'),
        Dense(1, activation='relu')
    ]
)

In [13]:
model.compile(
    optimizer='adam',
    loss='MSE'
)

5 epochs chosen experimentally because this is level where validation loss starts to level out

In [14]:
history = model.fit(
    x=X_train,
    y=y_train,
    batch_size=16,
    validation_split=0.25,
    epochs=5
)

Epoch 1/5
   14/81606 [..............................] - ETA: 5:22 - loss: 256.3229   

2022-08-16 21:11:35.722346: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-16 21:16:51.639484: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<h3>Basic Model Performance Analysis<h3>

In [15]:
comparison = y_test.iloc[:10].copy()
comparison.loc[:, 'prediction_avg_pay'] = model.predict(X_test.head(10))
comparison



2022-08-16 21:41:35.860019: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Unnamed: 0,Avg_driver_pay,prediction_avg_pay
1421589,22.206667,20.623829
1679459,15.556667,21.042217
600889,16.983902,19.256184
600890,15.181429,16.601954
592706,15.750909,16.055719
1420705,13.456703,15.08415
80970,14.17271,15.556137
79654,15.865714,16.260143
337620,16.064625,16.438223
880092,16.2524,17.761364


In [16]:
predictions = model.predict(X_test)
errors = np.array(predictions - y_test)
squared_errors = errors**2
mean_squared_error = squared_errors.mean()

print(f'MSE: {mean_squared_error}')

MSE: 11.046708291874292


In [17]:
tot_sum_squares = (np.array(y_test - y_test.mean())**2).sum()
r2 = 1 - (squared_errors.sum() / tot_sum_squares)
print(f'Model R^2: {r2:.4f}')

Model R^2: 0.4489


<h3>Save Predictions for Further Analysis <h3>

In [20]:
pd.DataFrame(predictions).to_csv('../data/curated/model_data/avg_driver_pay_pred_nn.csv')