# Neural Networks

Now we will build a fitting NN and compare its accuracy and computational time with our SVM results

# Da mein Model einfach sehr fragwürdige Ergebnisse liefert wird es gescrapt

In [1]:
import pandas as pd
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.optimizers import Adam
from tensorflow_addons.metrics import RSquare
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
from keras import layers


TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 



## Testing with a simple dataset

### Introduction

We will first work on our hourly aggregated data without using geodata just to test the influence of certain prediction parameters and methods

In [2]:
file_path = "./data/"
taxi = pd.read_csv(f"{file_path}taxi_hourly_processed.csv")
taxi.head()

Unnamed: 0,trip_start_timestamp,trip_amount,mean_trip_seconds,mean_trip_miles,mean_trip_total,start_temp,start_precip,start_windspeed,end_temp,end_precip,end_windspeed
0,2021-01-01 00:00:00,82,872.573171,4.837927,19.556829,-1.33,0.0,6.35,-1.319024,0.0,6.519024
1,2021-01-01 01:00:00,51,934.078431,5.023529,17.980392,-1.28,0.0,7.12,-1.285882,0.0,7.190588
2,2021-01-01 02:00:00,53,763.509434,4.466415,16.875283,-1.31,0.0,7.48,-1.281698,0.0,7.446038
3,2021-01-01 03:00:00,38,773.105263,4.000526,17.217368,-1.16,0.0,7.3,-1.136316,0.0,7.303947
4,2021-01-01 04:00:00,29,903.655172,3.885517,19.018621,-0.98,0.0,7.33,-0.906207,0.0,7.615172


Since end_temp, end_precip, end_windspeed are bound to the end of the trip we will drop them, aswell as mean_trip_seconds, mean_trip_miles, mean_trip_total, as they would be data we wouldnt have if we tried to predict a future demand.

In [3]:
taxi = taxi.drop(columns = ['mean_trip_seconds','mean_trip_miles','mean_trip_total','end_temp','end_precip','end_windspeed'])

In [4]:
taxi.describe()

Unnamed: 0,trip_amount,start_temp,start_precip,start_windspeed
count,8760.0,8760.0,8760.0,8760.0
mean,423.985502,11.17475,0.056735,7.09228
std,313.461767,10.013487,0.231349,3.416467
min,0.0,-16.45,0.0,0.22
25%,146.0,2.6275,0.0,4.46
50%,378.0,10.905,0.0,6.93
75%,639.0,20.3125,0.0,9.3
max,1487.0,29.98,1.0,21.83


In [5]:
taxi = taxi.astype({'trip_start_timestamp': 'datetime64[ns]'})

### Splitting and normalizing dataset

Before we can start normalizing, we will first split our dataset. We will create a Test and training set, as well as a validation set, which we will test against cross validation later

In [6]:
#split dataset into train & test data
train_dataset = taxi.sample(frac=0.7, random_state=0)
test_dataset = taxi.drop(train_dataset.index)
#also we will split split train data into validation for later comparision vs cross validation
train_vali_dataset = train_dataset.sample(frac=0.8, random_state=0)
vali_dataset = train_dataset.drop(train_vali_dataset.index)

We will first create datasets from our data.

In [7]:
def df_to_dataset(dataframe, batch_size=32):
  df = dataframe.copy()
  labels = df.pop('trip_amount')
  ds = tf.data.Dataset.from_tensor_slices((df, labels))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(batch_size)
  return ds

Since our target variable and all our features except the time axis are numerical and continous we will at the start ignore trip_start_timestamp. Also using time series data requires further work to be useable, which we will do later

In [8]:
train_ds = df_to_dataset(train_dataset.drop(["trip_start_timestamp"], axis=1))
#test_ds = df_to_dataset(test_dataset.drop(["trip_start_timestamp"], axis=1))

train_vali_ds = df_to_dataset(train_vali_dataset.drop(["trip_start_timestamp"], axis=1))
vali_ds = df_to_dataset(vali_dataset.drop(["trip_start_timestamp"], axis=1))

In [9]:
#normalizing
normalizer = tf.keras.layers.Normalization(axis=-1)
feature_ds = train_ds.map(lambda x, y: x)
normalizer.adapt(feature_ds)
print(normalizer.variance)

tf.Tensor([[1.01518524e+02 5.77008724e-02 1.17243452e+01]], shape=(1, 3), dtype=float32)


As the variance is roughly in line with the means we calculated earlier, we can assume that all our features have normalized correctly

### Simple prediction

Next we will compute our "baseline" prediction, using 2 Linear Layers with 64 Nodes and RELU Activation functions. It is very basic and therefore a good start. We also define a set learning rate, optimization and loss function and will only try to improve it at the end

In [10]:
def get_basic_model(lr=0.01):
  mod = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
  ])
  opt = Adam(learning_rate=lr)
  mod.compile(optimizer=opt,
                loss='mse',
                metrics=['mae', RSquare()])
  return mod

In [11]:
#prediction with validation split
def split_valid(train_set, valid_set, Built_model, num_epochs = 30, verb_f = 0, verb_e = 0):
    #higher epochs lead to better training but worse validation/testing results
    Built_model.fit(train_set, epochs=num_epochs, verbose=verb_f)

    val_mse, val_mae, val_r2 = Built_model.evaluate(valid_set, verbose=verb_e)
    print("MSE", val_mse)
    print("MAE", val_mae)
    print("R2", val_r2)

In [12]:
#prediction with cross validation
def cross_valid(ds, Built_model, shards = 5, num_epochs = 25, verb_f = 0, verb_e = 0 ):
    #higher epochs lead to better training but worse validation/testing results
    all_mse_scores = [] 
    all_mae_scores = [] 
    all_r2_scores = [] 

    for i in range(shards):
        cross_vali_ds = ds.shard(num_shards=shards, index=i)
        init = True

        for j in range(shards):
            if i == j:
                continue
            if init:
                cross_train_ds = ds.shard(num_shards=shards, index=j)
                init = False
                continue
            cross_train_ds = cross_train_ds.concatenate(ds.shard(num_shards=shards, index=j))

        
        Built_model.fit(cross_train_ds, epochs=num_epochs, verbose=verb_f)

        val_mse, val_mae, val_r2 = Built_model.evaluate(cross_vali_ds, verbose=verb_e)
        
        # Add Mean Absolut Error to All Scored List
        all_mse_scores.append(val_mse)
        all_mae_scores.append(val_mae)
        all_r2_scores.append(val_r2)

    print("Mean MSE", np.mean(all_mse_scores))
    print("Mean MAE", np.mean(all_mae_scores))
    print("Mean R2", np.mean(all_r2_scores))

In [13]:
model = get_basic_model()
split_valid(train_vali_ds, vali_ds, model, verb_e = 1)

MSE 92081.8515625
MAE 240.69772338867188
R2 0.10678142309188843


Since we are working with extremly little data our Accuracy is very low

In [14]:
model2 = get_basic_model()
cross_valid(train_ds, model2, shards = 5, verb_e = 1)

Mean MSE 81157.640625
Mean MAE 228.0220703125
Mean R2 0.16630287170410157


Cross validation results in better results, while split validation is way faster

### Testing different model compositions

We will now test 3 other layer compositions to try to increase our R2 Score

In [15]:
model3 = get_basic_model()
model3.pop()
model3.pop()
model3.add(layers.Dense(128, activation="relu"))
model3.add(layers.Dense(256, activation="relu"))
model3.add(layers.Dense(512, activation="relu"))
model3.add(layers.Dense(256, activation="relu"))
model3.add(layers.Dense(128, activation="relu"))
model3.add(layers.Dense(64, activation="relu"))
model3.add(layers.Dense(1))
#split_valid(train_vali_ds, vali_ds, model3)
cross_valid(train_ds, model3, shards = 5)

Mean MSE 78991.70625
Mean MAE 224.50763854980468
Mean R2 0.1885099768638611


In [16]:
model4 = get_basic_model()
model4.pop()
model4.pop()
model4.add(layers.Dense(18, activation="relu"))
model4.add(layers.Dense(9, activation="relu"))
model4.add(layers.Dense(18, activation="relu"))
model4.add(layers.Dense(1))
#split_valid(train_vali_ds, vali_ds, model4)
cross_valid(train_ds, model4, shards = 5)

Mean MSE 80199.2890625
Mean MAE 225.8953857421875
Mean R2 0.17613511085510253


In [17]:
model5 = get_basic_model()
model5.pop()
model5.pop()
model5.add(layers.Dense(1024, activation="relu"))
model5.add(layers.Dense(512, activation="relu"))
model5.add(layers.Dense(256, activation="relu"))
model5.add(layers.Dense(1))
#split_valid(train_vali_ds, vali_ds, model5)
cross_valid(train_ds, model5, shards = 5)

Mean MSE 79054.7203125
Mean MAE 224.33911743164063
Mean R2 0.18783769607543946


Even though more noded layers highly increase the computation time, the results are not improved significantly until the last "high noded" version, which also has a very high computational time

### Implementing categoricals

We will now try to improve our results by using precipation as a categorical and also extracting catgoricals from the time data. 

For that we will have to create multiple help functions. The first one creates  a dictonary dataset, since our features will not be of the same dtype going forward and tensors can't handle that . The second one takes a dict and stacks it to make it processable by a normaization layer. 
The third one is our precrocessing layer, in which we take our data and according to their datatype (given by lists of feature names), we will preprocess them. The last one takes a preprocessing layer and build a model with it.

#### Helper functions

We will use A special Crossvalidation technique for timeseries from sklearn TimeSeriesSplit. This allows us to predict Timeseries accurately

In [18]:
#The first one creates  a dictonary dataset, since our features will not be of the same dtype going forward and tensors can't handle that 
def df_to_dictonary_dataset(dataframe, batch_size=32, target = 'trip_demand'):
  df = dataframe.copy()
  labels = df.pop(target)
  
  dict_ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
  dict_ds = dict_ds.batch(batch_size)
  dict_ds = dict_ds.prefetch(batch_size)
  return dict_ds

In [19]:
#The second one takes a dict and stacks it to make it processable by a normaization layer.
def stack_dict(inputs, fun=tf.stack):
    values = []
    for key in sorted(inputs.keys()):
      values.append(tf.cast(inputs[key], tf.float32))

    return fun(values, axis=-1)

In [20]:
#The third one is our precrocessing layer, in which we take our data and according to their datatype (given by lists of feature names), we will preprocess them.
def build_preprocessed(processing_df, bin_feat, cate_feat, num_feat):
  inputs = {} #convert data into keras input
  for name, column in processing_df.items():
    if (name in cate_feat or name in bin_feat):
      dtype = tf.int64
    else:
      dtype = tf.float32

    inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype)

  #preprocess binary data -> not much preprocessing being done
  preprocessed = []
  for name in bin_feat:
      inp = inputs[name]
      inp = inp[:, tf.newaxis]
      float_value = tf.cast(inp, tf.float32)
      preprocessed.append(float_value)

  #preprocess numeric data -> normization
  numeric_features = pd.DataFrame()
  for name in num_feat:
    numeric_features[name]=processing_df[name]

  tf.convert_to_tensor(numeric_features)

  normalizer = tf.keras.layers.Normalization(axis=-1)
  normalizer.adapt(stack_dict(dict(numeric_features)))

  numeric_inputs = {}
  for name in num_feat:
    numeric_inputs[name]=inputs[name]

  numeric_inputs = stack_dict(numeric_inputs)
  numeric_normalized = normalizer(numeric_inputs)

  preprocessed.append(numeric_normalized)

  #preprocess categorical data -> create dummy binaries for each category of a feature
  for name in cate_feat:
    vocab = sorted(set(processing_df[name]))

    lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')

    x = inputs[name][:, tf.newaxis]
    x = lookup(x)
    preprocessed.append(x)

  return inputs, preprocessed

In [21]:
#The last one takes a preprocessing layer and build a model with it.
def do_preprocessing_and_model_building(processing_df, bin_feat, cate_feat, num_feat, lr = 0.1):
  inputs, preprocessed = build_preprocessed(processing_df, bin_feat, cate_feat, num_feat)
  preprocessed_result = tf.concat(preprocessed, axis=-1)
  preprocessor = tf.keras.Model(inputs, preprocessed_result)

  #Since our results in model5 are so high we will use their layers
  
  body = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  x = preprocessor(inputs)
  result = body(x)
  model = tf.keras.Model(inputs, result)
  opt = Adam(#learning_rate=lr
  )
  model.compile(optimizer=opt,
                loss='mse',
                metrics=['mae', RSquare()])
  return model

In [22]:
def split_data_and_cross_valid_for_timedata(df, binary_feature_names, categorical_feature_names, numeric_feature_names, splits = 5, num_epochs = 25, verb_f = 0, verb_e = 0, test = False, target = 'trip_demand', ler = 0.1):
    work_df = df.copy()

    work_df_preprocessing= work_df.loc[:, work_df.columns != target]
    mod = do_preprocessing_and_model_building(work_df_preprocessing, binary_feature_names, categorical_feature_names, numeric_feature_names, lr = ler)

        
    all_mse_scores = [] 
    all_mae_scores = [] 
    all_r2_scores = [] 
    tscv = TimeSeriesSplit(splits)
    counter = 0
    #epoch_inc = ((num_epochs)/splits)
    
    for train_index, test_index in tscv.split(work_df):
        counter = counter+1
        #epochs= int(epoch_inc*counter) #epochs fitted to dataset size
        
        cross_train_df, cross_vali_df = work_df.iloc[train_index, :], work_df.iloc[test_index,:]
        
        cross_train_ds = df_to_dictonary_dataset(cross_train_df, target = target)
        cross_vali_ds = df_to_dictonary_dataset(cross_vali_df, target = target)

        if test and counter == splits:
            val_mse, val_mae, val_r2 = mod.evaluate(cross_vali_ds, verbose=verb_e)
            continue
        elif test == False and counter == splits:
            continue

        mod.fit(cross_train_ds, epochs=num_epochs, verbose=verb_f)

        val_mse, val_mae, val_r2 = mod.evaluate(cross_vali_ds, verbose=verb_e)
        
        # Add Mean Absolut Error to All Scored List
        all_mse_scores.append(val_mse)
        all_mae_scores.append(val_mae)
        all_r2_scores.append(val_r2)
    
    
    print("Mean MSE", np.mean(all_mse_scores))
    print("Mean MAE", np.mean(all_mae_scores))
    print("Mean R2", np.mean(all_r2_scores))
    if test:
        print("")
        print("Test MSE", val_mse)
        print("Test MAE", val_mae)
        print("Test R2", val_r2)

In [23]:
def cross_valid_for_timedata(df, bin_feat, num_feat, splits = 5, num_epochs = 25, verb_f = 0, verb_e = 0, test = False, target = 'trip_demand', lr = 0.01):
    split_df = df.copy()
    labels = split_df.pop(target)
    inputs = {} #convert data into keras input
    for name, column in split_df.items():
        if (name in bin_feat):
            dtype = tf.int64
        else:
            dtype = tf.float32

        inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype)

    preprocessed = []
    for name in bin_feat:
        inp = inputs[name]
        inp = inp[:, tf.newaxis]
        float_value = tf.cast(inp, tf.float32)
        preprocessed.append(float_value)

    numeric_features = pd.DataFrame()
    for name in num_feat:
        numeric_features[name]=split_df[name]

    tf.convert_to_tensor(numeric_features)

    normalizer = tf.keras.layers.Normalization(axis=-1)
    normalizer.adapt(stack_dict(dict(numeric_features)))

    numeric_inputs = {}
    for name in num_feat:
        numeric_inputs[name]=inputs[name]

    numeric_inputs = stack_dict(numeric_inputs)
    numeric_normalized = normalizer(numeric_inputs)

    preprocessed.append(numeric_normalized)

    preprocessor = tf.keras.Model(inputs, (tf.concat(preprocessed, axis=-1)))

    body = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1)
    ])

    result = body(preprocessor(inputs))
    model = tf.keras.Model(inputs, result)
    model.compile(optimizer=Adam(#learning_rate=lr
                    ),
                    loss='mse',
                    metrics=['mae', RSquare()])

    
    all_mse_scores = [] 
    all_mae_scores = [] 
    all_r2_scores = [] 
    tscv = TimeSeriesSplit(splits)
    counter = 0
    #epoch_inc = ((num_epochs)/splits)
    work_df = df.copy()
    for train_index, test_index in tscv.split(work_df):
        counter = counter+1
        #epochs= int(epoch_inc*counter) #epochs fitted to dataset size
        
        cross_train_df, cross_vali_df = work_df.iloc[train_index, :], work_df.iloc[test_index,:]
        
        cross_train_ds = df_to_dictonary_dataset(cross_train_df, target = target)
        cross_vali_ds = df_to_dictonary_dataset(cross_vali_df, target = target)

        if test and counter == splits:
            val_mse, val_mae, val_r2 = model.evaluate(cross_vali_ds, verbose=verb_e)
            continue
        elif test == False and counter == splits:
            continue

        model.fit(cross_train_ds, epochs=num_epochs, verbose=verb_f)

        val_mse, val_mae, val_r2 = model.evaluate(cross_vali_ds, verbose=verb_e)
        
        # Add Mean Absolut Error to All Scored List
        all_mse_scores.append(val_mse)
        all_mae_scores.append(val_mae)
        all_r2_scores.append(val_r2)
    
    
    print("Mean MSE", np.mean(all_mse_scores))
    print("Mean MAE", np.mean(all_mae_scores))
    print("Mean R2", np.mean(all_r2_scores))
    if test:
        print("")
        print("Test MSE", val_mse)
        print("Test MAE", val_mae)
        print("Test R2", val_r2)

#### Application for precip as categorical

We will now apply our preprocessing layer to our dataset with precipitation as a categorical feature

In [24]:
taxi_prec = taxi.copy()
taxi_prec = taxi_prec.drop(["trip_start_timestamp"], axis=1)
taxi_prec["start_precip"] = taxi_prec["start_precip"].apply(lambda x: 0 if (x <= 0) else 1 )

train_prec = taxi_prec.sample(frac=0.7, random_state=0)
test_prec = taxi_prec.drop(train_prec.index)

train_prec_ds = df_to_dictonary_dataset(train_prec, target = 'trip_amount')
#test_prec_ds = df_to_dictonary_dataset(test_prec, target = 'trip_amount')

In [25]:
binary_feature_names = ['start_precip']
categorical_feature_names = []
numeric_feature_names = ['start_temp', 'start_windspeed']

taxi_prec_preprocessing= taxi_prec.loc[:, taxi_prec.columns != 'trip_amount']
#The whole dataset is thrown into processing and model building but no fitting is done yet so that does not destroy the purpose of validation
model6 = do_preprocessing_and_model_building(taxi_prec_preprocessing, binary_feature_names, categorical_feature_names, numeric_feature_names)

cross_valid(train_prec_ds, model6, shards = 5)

Mean MSE 81852.04375
Mean MAE 231.20285034179688
Mean R2 0.15918352603912353


The results are roughly the same as when using precipation as a continous numerical variable

#### Application for timedata as categorical

Next we will will apply our Helper functions to categoricals from the timedata

In [26]:
taxi_date = taxi.copy()
taxi_date["weekday"] = taxi_date["trip_start_timestamp"].apply(lambda x:x.dayofweek)
taxi_date["month"] = taxi_date["trip_start_timestamp"].apply(lambda x:x.month)
taxi_date["is_weekend"] = taxi_date["weekday"].apply(lambda x: 1 if (x in [5,6]) else 0)
taxi_date["hour"] = taxi_date["trip_start_timestamp"].apply(lambda x:x.hour)
taxi_date.set_index('trip_start_timestamp', inplace=True)
taxi_date.sort_index(inplace=True)

binary_feature_names = ['is_weekend']
categorical_feature_names = ['weekday', 'month', 'hour']
numeric_feature_names = ['start_temp', 'start_precip', 'start_windspeed']

split_data_and_cross_valid_for_timedata(taxi_date, binary_feature_names, categorical_feature_names, numeric_feature_names, splits = 5, verb_e = 1, num_epochs = 60, test = True, target = 'trip_amount')
#even though our model is extremly overfitted, when using lower epochs the test results are even worse

Mean MSE 35576.835205078125
Mean MAE 139.24091911315918
Mean R2 0.567512720823288

Test MSE 103767.515625
Test MAE 262.2021179199219
Test R2 0.23279625177383423


In [27]:
#we can't use sample since we dont want to split our timedata
n = len(taxi_date)
train_date = taxi_date[0:int(n*0.7)]
test_date = taxi_date[int(n*0.7):]

train_date_ds = df_to_dictonary_dataset(train_date, target = 'trip_amount')
#test_date_ds = df_to_dictonary_dataset(test_date, target = 'trip_amount')

In [28]:
taxi_date_preprocessing= taxi_date.loc[:, taxi_date.columns != 'trip_amount']
#The whole dataset is thrown into processing and model building but no fitting is done yet so that does not destroy the purpose of validation
model7 = do_preprocessing_and_model_building(taxi_date_preprocessing, binary_feature_names, categorical_feature_names, numeric_feature_names)

cross_valid(train_date_ds, model7, shards = 5, num_epochs = 20,  verb_e = 1 )

Mean MSE 2073.85244140625
Mean MAE 31.700415420532227
Mean R2 0.9678538560867309


In [29]:
test_date_ds = df_to_dictonary_dataset(test_date, target = 'trip_amount')
val_mse, val_mae, val_r2 = model7.evaluate(test_date_ds)
print("MSE", val_mse)
print("MAE", val_mae)
print("R2", val_r2)

 1/83 [..............................] - ETA: 0s - loss: 14849.6270 - mae: 109.0818 - r_square: 0.8879

MSE 83865.625
MAE 232.4762420654297
R2 0.3818899393081665


With the categorical timedata included our results have improved significantly, but as we can see our Model is extremly overfitted. We will now introduce the aggregated time/geodata to help alleviate this problem

In [30]:
del model, model2, model3, model4, model5, model6, model7, vali_ds, vali_dataset, train_vali_ds, train_vali_dataset, train_prec_ds, train_prec, train_ds, train_date_ds, train_date, train_dataset
del feature_ds, normalizer, taxi_date, taxi_prec, taxi_prec_preprocessing, taxi_date_preprocessing, test_dataset, test_date, test_prec

## Applying found knowledge to Timebuckets & Geodata

We will now start to use the aggregated datasets (timebuckets & Geodata). For that we will first start with aggregation for hourly and a high resolution geoanalysis and then compare bigger timebuckets and different georesolutions

In [31]:
binary_feature_names = ['is_weekday', 'is_holiday', 'season_Autumn', 'season_Spring', 'season_Summer', 'season_Winter', 'precip']
categorical_feature_names = []
numeric_feature_names = ['temp_z', 'windspeed_z', 'temp_z_ma_7d', 'temp_z_std_7d', 'weekday_sin', 'weekday_cos']

In [32]:
file_path = "./data/"
Aggregated_Census_1H = pd.read_pickle(f"{file_path}taxi_by_census_tract_1H.pkl")
Aggregated_Census_1H.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_demand,temp_z,precip,windspeed_z,is_holiday,is_weekday,temp_z_ma_7d,temp_z_std_7d,weekday_sin,weekday_cos,season_Autumn,season_Spring,season_Summer,season_Winter
pickup_census_tract,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
17031980000.0,2021-01-01 00:00:00,0,-1.248857,0,-0.217303,1,0,-1.053704,0.181925,-0.866025,-0.5,0,0,0,1
17031980000.0,2021-01-01 01:00:00,0,-1.243863,0,0.008083,1,0,-1.053704,0.181925,-0.866025,-0.5,0,0,0,1
17031980000.0,2021-01-01 02:00:00,0,-1.246859,0,0.113459,1,0,-1.053704,0.181925,-0.866025,-0.5,0,0,0,1
17031980000.0,2021-01-01 03:00:00,0,-1.231879,0,0.060771,1,0,-1.053704,0.181925,-0.866025,-0.5,0,0,0,1
17031980000.0,2021-01-01 04:00:00,0,-1.213902,0,0.069552,1,0,-1.053704,0.181925,-0.866025,-0.5,0,0,0,1


In [33]:
n = len(Aggregated_Census_1H)
train_agg = Aggregated_Census_1H[0:int(n*0.7)]
vali_agg = Aggregated_Census_1H[int(n*0.7):int(n*0.9)]
test_agg = Aggregated_Census_1H[int(n*0.9):]

train_agg_preprocessing= Aggregated_Census_1H.loc[:, Aggregated_Census_1H.columns != 'trip_demand']

model8 = do_preprocessing_and_model_building(train_agg_preprocessing, binary_feature_names, categorical_feature_names, numeric_feature_names)

train_agg_ds = df_to_dictonary_dataset(train_agg, target = 'trip_demand')
vali_agg_ds = df_to_dictonary_dataset(vali_agg, target = 'trip_demand')
test_agg_ds = df_to_dictonary_dataset(test_agg, target = 'trip_demand')

split_valid(train_agg_ds, vali_agg_ds, model8, num_epochs = 5, verb_f = 1, verb_e = 1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
MSE 0.001681785681284964
MAE 0.0015342546394094825
R2 -0.0010836124420166016


In [34]:
val_mse, val_mae, val_r2 = model8.evaluate(test_agg_ds)
print("MSE", val_mse)
print("MAE", val_mae)
print("R2", val_r2)

    1/13223 [..............................] - ETA: 5:42 - loss: 7.3649e-07 - mae: 8.5819e-04 - r_square: 0.0000e+00

MSE 0.0014454586198553443
MAE 0.001205719425342977
R2 -0.0010412931442260742


In [36]:
cross_valid_for_timedata(Aggregated_Census_1H, binary_feature_names, numeric_feature_names, splits = 5, num_epochs = 5, verb_f = 1, verb_e = 1, test = True, lr = 0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Mean MSE 0.14407277229474857
Mean MAE 0.042865036288276315
Mean R2 -0.06162843108177185

Test MSE 0.001310288324020803
Test MAE 0.008437439799308777
Test R2 -0.05730998516082764


In [37]:
file_path = "./data/"
Aggregated_Census_2H = pd.read_pickle(f"{file_path}taxi_by_census_tract_2H.pkl")
Aggregated_Census_2H.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_demand,temp_z,precip,windspeed_z,is_holiday,is_weekday,temp_z_ma_7d,temp_z_std_7d,weekday_sin,weekday_cos,season_Autumn,season_Spring,season_Summer,season_Winter
pickup_census_tract,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
17031980000.0,2021-01-01 00:00:00,0,-1.246887,0,-0.105108,1,0,-1.05415,0.182006,-0.866025,-0.5,0,0,0,1
17031980000.0,2021-01-01 02:00:00,0,-1.239893,0,0.08753,1,0,-1.05415,0.182006,-0.866025,-0.5,0,0,0,1
17031980000.0,2021-01-01 04:00:00,0,-1.203924,0,0.190466,1,0,-1.05415,0.182006,-0.866025,-0.5,0,0,0,1
17031980000.0,2021-01-01 06:00:00,0,-1.135484,0,0.865433,1,0,-1.05415,0.182006,-0.866025,-0.5,0,0,0,1
17031980000.0,2021-01-01 08:00:00,1,-1.103012,0,1.106598,1,0,-1.05415,0.182006,-0.866025,-0.5,0,0,0,1


In [39]:
n = len(Aggregated_Census_2H)
train_agg = Aggregated_Census_2H[0:int(n*0.7)]
vali_agg = Aggregated_Census_2H[int(n*0.7):int(n*0.9)]
test_agg = Aggregated_Census_2H[int(n*0.9):]

train_agg_ds = df_to_dictonary_dataset(train_agg, target = 'trip_demand')
vali_agg_ds = df_to_dictonary_dataset(vali_agg, target = 'trip_demand')
test_agg_ds = df_to_dictonary_dataset(test_agg, target = 'trip_demand')

train_agg_preprocessing= Aggregated_Census_2H.loc[:, Aggregated_Census_2H.columns != 'trip_demand']


model9 = do_preprocessing_and_model_building(train_agg_preprocessing, binary_feature_names, categorical_feature_names, numeric_feature_names)

split_valid(train_agg_ds, vali_agg_ds, model9, num_epochs = 5, verb_f = 1, verb_e = 1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
MSE 0.003525800770148635
MAE 0.0025801549199968576
R2 -0.0017242431640625


In [40]:
val_mse, val_mae, val_r2 = model9.evaluate(test_agg_ds)
print("MSE", val_mse)
print("MAE", val_mae)
print("R2", val_r2)

MSE 0.0030616074800491333
MAE 0.0019228679593652487
R2 -0.001232743263244629


Da mein Model einfach sehr fragwürdige Ergebnisse liefert wird es gescrapt

In [None]:
#compare with SVM