# Neural Networks

Now we will build a fitting NN and compare its accuracy and computational time with our SVM results

In [1]:
import pandas as pd
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop

## Testing with a simple dataset

### Introduction

We will first work on our hourly aggregated data without using geodata just to test the influence of certain prediction parameters and methods

In [2]:
file_path = "./data/"
taxi = pd.read_csv(f"{file_path}taxi_hourly_processed.csv")
taxi.head(25)

Unnamed: 0,trip_start_timestamp,trip_amount,mean_trip_seconds,mean_trip_miles,mean_trip_total,start_temp,start_precip,start_windspeed,end_temp,end_precip,end_windspeed
0,2021-01-01 00:00:00,82,872.573171,4.837927,19.556829,-1.33,0.0,6.35,-1.319024,0.0,6.519024
1,2021-01-01 01:00:00,51,934.078431,5.023529,17.980392,-1.28,0.0,7.12,-1.285882,0.0,7.190588
2,2021-01-01 02:00:00,53,763.509434,4.466415,16.875283,-1.31,0.0,7.48,-1.281698,0.0,7.446038
3,2021-01-01 03:00:00,38,773.105263,4.000526,17.217368,-1.16,0.0,7.3,-1.136316,0.0,7.303947
4,2021-01-01 04:00:00,29,903.655172,3.885517,19.018621,-0.98,0.0,7.33,-0.906207,0.0,7.615172
5,2021-01-01 05:00:00,30,683.6,4.746,17.094667,-0.77,0.0,8.15,-0.662667,0.0,8.549
6,2021-01-01 06:00:00,32,1054.96875,7.74375,24.189062,-0.31,0.0,9.86,-0.235937,0.0,9.975
7,2021-01-01 07:00:00,65,873.984615,6.4,20.919692,-0.07,0.0,10.21,-0.015077,0.0,10.348615
8,2021-01-01 08:00:00,90,808.666667,5.223667,19.810444,0.14,0.0,10.74,0.137889,0.0,10.788556
9,2021-01-01 09:00:00,96,915.927083,5.660417,20.716562,0.13,0.0,10.97,0.119063,0.0,10.867188


In [3]:
taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   trip_start_timestamp  8760 non-null   object 
 1   trip_amount           8760 non-null   int64  
 2   mean_trip_seconds     8760 non-null   float64
 3   mean_trip_miles       8760 non-null   float64
 4   mean_trip_total       8760 non-null   float64
 5   start_temp            8760 non-null   float64
 6   start_precip          8760 non-null   float64
 7   start_windspeed       8760 non-null   float64
 8   end_temp              8760 non-null   float64
 9   end_precip            8760 non-null   float64
 10  end_windspeed         8760 non-null   float64
dtypes: float64(9), int64(1), object(1)
memory usage: 752.9+ KB


In [4]:
taxi = taxi.astype({'trip_start_timestamp': 'datetime64[ns]'})

### Splitting and normalizing dataset

Before we can start normalizing, we will first split our dataset. We will create a Test and training set, as well as a validation set, which we will test against cross validation later

In [5]:
#split dataset into train & test data
train_dataset = taxi.sample(frac=0.7, random_state=0)
test_dataset = taxi.drop(train_dataset.index)
#also we will split split train data into validation for later comparision vs cross validation
train_vali_dataset = train_dataset.sample(frac=0.8, random_state=0)
vali_dataset = train_dataset.drop(train_vali_dataset.index)

We will first create datasets from our data. The following code was taken from https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers#create_an_input_pipeline_using_tfdata

In [6]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  df = dataframe.copy()
  labels = df.pop('trip_amount')
  ds = tf.data.Dataset.from_tensor_slices((df, labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(batch_size)
  return ds

Since our target variable and all our features except the time axis are numerical and continous we will at the start ignore trip_start_timestamp. Also using time series data requires further work to be useable, which we will do later

In [7]:
train_ds = df_to_dataset(train_dataset.drop(["trip_start_timestamp"], axis=1))
test_ds = df_to_dataset(test_dataset.drop(["trip_start_timestamp"], axis=1))

train_vali_ds = df_to_dataset(train_vali_dataset.drop(["trip_start_timestamp"], axis=1))
vali_ds = df_to_dataset(vali_dataset.drop(["trip_start_timestamp"], axis=1))

In [8]:
#normalizing
normalizer = tf.keras.layers.Normalization(axis=-1)
feature_ds = train_ds.map(lambda x, y: x)
normalizer.adapt(feature_ds)
print(normalizer.mean)

tf.Tensor(
[[1.0245482e+03 5.8622885e+00 2.3869076e+01 1.1090101e+01 6.1480746e-02
  7.0873618e+00 1.1094255e+01 6.1240621e-02 7.0909762e+00]], shape=(1, 9), dtype=float32)


### Simple prediction

Next we will compute our "baseline" prediction

In [9]:
#predictions without timedata
def get_basic_model():
  model = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
  ])
  opt = RMSprop(lr=0.001)
  model.compile(optimizer=opt,
                loss='mse',
                metrics=['mae'])
  return model

In [10]:
model = get_basic_model()
model.fit(train_ds, epochs=15, batch_size=32)
#next real model with vali



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x20400efb8d0>

In [11]:
#compare valis

In [12]:
#predictions without timedata
#build core model and build then models with more layers
#compare

In [13]:
#prediction with precip as categorical
#compare

In [14]:
#build up timedata as categorical
#compare

In [15]:
#build up timedata as timeseries
#compare

In [16]:
#build best model for one aggregated and use it on two vs build best model for both
#compare

In [17]:
#use best model to model each aggrgated version
#compare best result

In [18]:
#compare with SVM

In [19]:
#MAYBE add timeseries for 3 years

In [20]:
#To Do

#Build NN for standard hourly and then later train model for the aggregated versions
#show difference using cross vali vs split vali and duration
#show difference between standard vs with geo data, vs 
#show difference trained for one aggregated and used for all vs trained for each

#precip as categorical?

#How to solve time variable = 
#1 use timeseries
#2 use dummy /categoricals

#Maybe add Timeseries vor 3 Years to show improvement