# Initial Modeling

_By [Michael Rosenberg](mailto:rosenberg.michael.m@gmail.com)._

_**Description**: Contains my initial modeling techniques based on the intuition developed from my [EDA](eda.ipynb)._

_Last Updated: 9/5/2017 7:01 PM._

In [60]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import seaborn as sns
import pickle as pkl
from sklearn import preprocessing as pp
from sklearn import linear_model as lm

#helpers
%matplotlib inline
sns.set_style("whitegrid")
sigLev = 3
percentLev = 100
alphaLev = .2
numBins = 30
pd.set_option("display.precision",sigLev)

In [2]:
#load in data
trainFrame = pd.read_csv("../data/raw/train.csv")
testFrame = pd.read_csv("../data/raw/test.csv")

In [3]:
#some helpers
def exportPredictions(testFrame,predictorVars,model,predictionName,
                      predictingLog = False):
    #helper for exporting our predictions
    predictionMat = testFrame[predictorVars]
    testFrame["trip_duration"] = model.predict(predictionMat)
    if (predictingLog): #need to exponentiate
        testFrame["trip_duration"] = np.exp(testFrame["trip_duration"])
    #then export
    exportFrame = testFrame[["id","trip_duration"]]
    if (".csv" not in predictionName):
        predictionName += ".csv" #just to keep consistency
    exportFrame.to_csv(predictionName,index = False)
    
def exportModel(model,modelName):
    #helper for exporting our model
    pkl.dump(model,open(modelName,"wb"))
    
def rmsle(predictions,actuals):
    #helper for calculating RMSLE
    pass

# Day-Intensive Model

While our EDA would suggest otherwise, I would argue that it makes sense for trip duration to be informed by the time of day of the trip. In particular, on a certain day of the week and at a certain hour, we should expect a certain level of traffic within the city. Let's start with a main effects model, and then fit a set of interaction effects. For our initial model types, we will just consider linear models. Based on our initial analysis, we will predict $\log(Trip Duration)$ and then exponentiate our predictions.

In [4]:
#get day of week and hour
trainFrame["pickup_datetime"] = pd.to_datetime(trainFrame["pickup_datetime"])
trainFrame["pickup_dow"] = trainFrame["pickup_datetime"].dt.dayofweek
trainFrame["pickup_hour"] = trainFrame["pickup_datetime"].dt.hour

In [5]:
trainFrame["logTripDuration"] = np.log(trainFrame["trip_duration"])

In [6]:
#get main effects
#day of week
dowEncoder = pp.OneHotEncoder()
trainDOWMat = np.array(trainFrame["pickup_dow"]).reshape(-1,1)
trainDOWMat = dowEncoder.fit_transform(trainDOWMat)
#hour
hourEncoder = pp.OneHotEncoder()
trainHourMat = np.array(trainFrame["pickup_hour"]).reshape(-1,1)
trainHourMat = hourEncoder.fit_transform(trainHourMat)

In [7]:
#then get our feature matrix
trainFeatureMat = sp.sparse.hstack((trainDOWMat,trainHourMat))
#then fit linear regression
initLinReg = lm.LinearRegression()
initLinReg.fit(trainFeatureMat,trainFrame["logTripDuration"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [8]:
#then get the same information for test frame
testFrame["pickup_datetime"] = pd.to_datetime(testFrame["pickup_datetime"])
testFrame["pickup_dow"] = testFrame["pickup_datetime"].dt.dayofweek
testFrame["pickup_hour"] = testFrame["pickup_datetime"].dt.hour

In [9]:
#get main effects
#day of week
dowEncoder = pp.OneHotEncoder()
testDOWMat = np.array(testFrame["pickup_dow"]).reshape(-1,1)
testDOWMat = dowEncoder.fit_transform(testDOWMat)
#hour
hourEncoder = pp.OneHotEncoder()
testHourMat = np.array(testFrame["pickup_hour"]).reshape(-1,1)
testHourMat = hourEncoder.fit_transform(testHourMat)

In [10]:
testFeatureMat = sp.sparse.hstack((testDOWMat,testHourMat))
testFrame["logPredictions"] = initLinReg.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [11]:
#then export information
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv("../data/processed/dowHourPredictions.csv",index = False)

This gets us in at an RMSLE of around $.79$. Not bad for a first try! Let's see if we can do any better once we throw our DOW-Hour interactions into the pot.

In [12]:
for i in xrange(len(trainFrame["pickup_dow"].unique())):
    #get particular day's dummy encoding
    givenDOWDummy = sp.sparse.diags(np.squeeze(trainDOWMat[:,i].toarray()))
    givenDOWHourInteractions = givenDOWDummy * trainHourMat
    #then add to our feature matrix
    trainFeatureMat = sp.sparse.hstack((trainFeatureMat,
                                        givenDOWHourInteractions))

In [13]:
#then fit
intLinReg = lm.LinearRegression()
intLinReg.fit(trainFeatureMat,trainFrame["logTripDuration"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [14]:
#do the same for test
for i in xrange(len(testFrame["pickup_dow"].unique())):
    #get particular day's dummy encoding
    givenDOWDummy = sp.sparse.diags(np.squeeze(testDOWMat[:,i].toarray()))
    givenDOWHourInteractions = givenDOWDummy * testHourMat
    #then add to our feature matrix
    testFeatureMat = sp.sparse.hstack((testFeatureMat,
                                        givenDOWHourInteractions))

In [15]:
testFrame["logPredictions"] = intLinReg.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [16]:
#then export information
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv("../data/processed/dowHourInteractionPredictions.csv",
                   index = False)

This barely improves our performance. Let's see if we do any better when we predict outside the log space.

In [17]:
nonLogLinReg = lm.LinearRegression()
nonLogLinReg.fit(trainFeatureMat,trainFrame["trip_duration"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [18]:
testFrame["trip_duration"] = nonLogLinReg.predict(testFeatureMat)

In [19]:
#then export information
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv("../data/processed/dowHourInteractionPredictions_nonLog.csv",
                   index = False)

As I had expected, we do much better when we predict in log space and then exponentiate back into the non-logged space. Just goes to show!

# Add Number of Passengers Encoding

As discussed before, we recognize that when we consider only those observations that have $0$ passengers, we get a much lower distribution of $\log(TripDuration)$ than when we have more than $0$ passengers. Let's throw it in there!

In [20]:
trainFrame["moreThan0Passengers"] = 0
trainFrame.loc[trainFrame["passenger_count"] > 0,
               "moreThan0Passengers"] = 1
testFrame["moreThan0Passengers"] = 0
testFrame.loc[testFrame["passenger_count"] > 0,
               "moreThan0Passengers"] = 1

In [21]:
trainPassengerFeatureMat = sp.sparse.csc_matrix(
                                            trainFrame["moreThan0Passengers"]).T
testPassengerFeatureMat = sp.sparse.csc_matrix(
                                            testFrame["moreThan0Passengers"]).T
trainFeatureMat = sp.sparse.hstack((trainFeatureMat,trainPassengerFeatureMat))
testFeatureMat = sp.sparse.hstack((testFeatureMat,testPassengerFeatureMat))

In [22]:
linRegWithPassenger = lm.LinearRegression()
linRegWithPassenger.fit(trainFeatureMat,trainFrame["logTripDuration"])
testFrame["logPredictions"] = linRegWithPassenger.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [23]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv("../data/processed/dhInteractionWithPassPredictions.csv",
                   index = False)

A slight improvement, but only by so much.

# Seasonality

Let's introduce a notion of seasonality into our problem by inserting month into our feature matrix

In [24]:
trainFrame["pickup_month"] = trainFrame["pickup_datetime"].dt.month
testFrame["pickup_month"] = testFrame["pickup_datetime"].dt.month
#get encoding
monthEncoder = pp.OneHotEncoder()
trainMonthMat = monthEncoder.fit_transform(
                            np.array(trainFrame["pickup_month"]).reshape(-1,1))
testMonthMat = monthEncoder.transform(
                            np.array(testFrame["pickup_month"]).reshape(-1,1))
#then append
trainFeatureMat = sp.sparse.hstack((trainFeatureMat,trainMonthMat))
testFeatureMat = sp.sparse.hstack((testFeatureMat,testMonthMat))
#then get a linear regression
newReg = lm.LinearRegression()
newReg.fit(trainFeatureMat,trainFrame["logTripDuration"])
#then predict
testFrame["logPredictions"] = newReg.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [25]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/dhInteractionWithPassAndSeasonPredictions.csv",
                   index = False)

That shot us up a place in the leaderboard! Good job.

# Get Location

One thing I am interested in is creating a measure of location. We can do this by creating tile encodings of location on our map of longitude and latitude of pickup points. Let's try to create this tile encoding.

In [26]:
def generateTileEncoding(featureFrame,locType,minLocLat,
                         maxLocLat,minLocLong,maxLocLong,step):
    #helper for generating our tile encoding for a given location type
    locLat = locType + "_latitude"
    locLong = locType + "_longitude"
    #get min and max for both
    #need trainframe so as to standardize the search on both parameter sets
    #then form latitude and longitude ranges
    latRange = np.arange(minLocLat,maxLocLat,step)
    longRange = np.arange(minLocLong,maxLocLong,step)
    #then get our matrix
    tileEncodingMat =  np.zeros((featureFrame.shape[0],len(latRange)*
                                                      len(longRange)))
    #then step through our ranges
    for i in xrange(len(latRange)):
        for j in xrange(len(longRange)):
            #get our box
            lat, lon = latRange[i], longRange[j]
            x0, y0, x1, y1 = lon, lat, lon + step, lat + step
            #form our tile encoding
            condition = ((x0 <= featureFrame[locLong]) &
                         (featureFrame[locLong] <= x1) &
                         (y0 <= featureFrame[locLat]) &
                         (featureFrame[locLat] <= y1))
            tileEncoding = list(condition.astype("int"))
            tileEncodingMat[:,(i * len(longRange) + j)] = tileEncoding
    #then filter out 0 variance observationa
    #tileEncodingMat = tileEncodingMat[:,(tileEncodingMat.sum(axis = 0) > 0)]
    return tileEncodingMat

trainPickupEncodingMat = generateTileEncoding(trainFrame,"pickup",
                                             40.2,41.8,-75,-72.5,.1)
trainDropoffEncodingMat = generateTileEncoding(trainFrame,"dropoff",
                                             40.2,41.8,-75,-72.5,.1)

In [27]:
trainFeatureMat = sp.sparse.hstack((trainFeatureMat,
                                sp.sparse.csr_matrix(trainPickupEncodingMat)))
trainFeatureMat = sp.sparse.hstack((trainFeatureMat,
                                sp.sparse.csr_matrix(trainDropoffEncodingMat)))

In [28]:
tileReg = lm.LinearRegression()
tileReg.fit(trainFeatureMat,trainFrame["logTripDuration"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [29]:
testPickupEncodingMat = generateTileEncoding(testFrame,"pickup",
                                             40.2,41.8,-75,-72.5,.1)
testDropoffEncodingMat = generateTileEncoding(testFrame,"dropoff",
                                             40.2,41.8,-75,-72.5,.1)

In [30]:
testFeatureMat = sp.sparse.hstack((testFeatureMat,
                                sp.sparse.csr_matrix(testPickupEncodingMat)))
testFeatureMat = sp.sparse.hstack((testFeatureMat,
                                sp.sparse.csr_matrix(testDropoffEncodingMat)))

In [31]:
testFrame["logPredictions"] = tileReg.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [32]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/tileEncodingSectoredPredictions.csv",
                   index = False)

Looks like that helped out a bit! Let's see what happens when we introduce a notion of distance into our model.

In [33]:
from haversine import haversine
trainFrame["distance"] = [haversine((x[1]["pickup_longitude"],
                                       x[1]["pickup_latitude"]),
                                      (x[1]["dropoff_longitude"],
                                       x[1]["dropoff_latitude"])) for x in
                          trainFrame.iterrows()]

testFrame["distance"] = [haversine((x[1]["pickup_longitude"],
                                       x[1]["pickup_latitude"]),
                                      (x[1]["dropoff_longitude"],
                                       x[1]["dropoff_latitude"])) for x in
                        testFrame.iterrows()]

In [34]:
trainDistanceMat = sp.sparse.csc_matrix(np.array(trainFrame["distance"])).T
testDistanceMat = sp.sparse.csc_matrix(np.array(testFrame["distance"])).T
trainFeatureMat = sp.sparse.hstack((trainFeatureMat,trainDistanceMat))
testFeatureMat = sp.sparse.hstack((testFeatureMat,testDistanceMat))

In [35]:
#then fit
distanceMod = lm.LinearRegression()
distanceMod.fit(trainFeatureMat,trainFrame["logTripDuration"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [36]:
testFrame["logPredictions"] = distanceMod.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [37]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/tileEncodingWithDistancePredictions.csv",
                   index = False)

Adding that got us way ahead! Let's see if the interaction between distance time of day has any impact.

In [38]:
trainDistanceDiag = sp.sparse.diags(np.array(trainFrame["distance"]))
testDistanceDiag = sp.sparse.diags(np.array(testFrame["distance"]))

In [39]:
trainDistHourInteractions = trainDistanceDiag * trainHourMat
testDistHourInteractions = testDistanceDiag * testHourMat

In [40]:
trainFeatureMat = sp.sparse.hstack((trainFeatureMat,trainDistHourInteractions))
testFeatureMat = sp.sparse.hstack((testFeatureMat,testDistHourInteractions))

In [41]:
distanceIntMod = lm.LinearRegression()
distanceIntMod.fit(trainFeatureMat,trainFrame["logTripDuration"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [42]:
testFrame["logPredictions"] = distanceIntMod.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [43]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/tileEncodingWithDistanceIntPredictions.csv",
                   index = False)

# Regularization

Now that the model has gotten quite big:

In [44]:
trainFeatureMat

<1458644x1031 sparse matrix of type '<type 'numpy.float64'>'
	with 13115873 stored elements in COOrdinate format>

We should consider some form of regularization. I think starting with an $L_2$ regularizer is reasonable given we have around $1031$ features.

In [45]:
ridgeReg = lm.Ridge()
ridgeReg.fit(trainFeatureMat,trainFrame["logTripDuration"])

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [46]:
testFrame["logPredictions"] = ridgeReg.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [47]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/ridgeRegressionPredictions.csv",
                   index = False)

We perform slightly worse in the context of the ridge regression. We can do better! Let's try some other variable transformations.

# Other Distance Metrics

Having observed some other [Kaggle Kernels](https://www.kaggle.com/priyanka13/nyc-using-xgboost-0-41), I thought to include some other relevant forms of distance. In particular, I want to include [Vincenty Distance](https://en.wikipedia.org/wiki/Vincenty's_formulae) and [Great Circle Distance](https://en.wikipedia.org/wiki/Great-circle_distance) within our model.

In [44]:
from geopy.distance import vincenty, great_circle
trainFrame["vincentyDistance"] = [vincenty((x[1]['pickup_latitude'],
                                           x[1]['pickup_longitude']),
            (x[1]['dropoff_latitude'], x[1]['dropoff_longitude'])).miles for
                    x in trainFrame.iterrows()]

In [45]:
#vincenty distance
testFrame["vincentyDistance"] = [vincenty((x[1]['pickup_latitude'],
                                           x[1]['pickup_longitude']),
            (x[1]['dropoff_latitude'], x[1]['dropoff_longitude'])).miles for
                    x in testFrame.iterrows()]
#great circle distance
trainFrame["gcDistance"] = [great_circle((x[1]['pickup_latitude'],
                                           x[1]['pickup_longitude']),
            (x[1]['dropoff_latitude'], x[1]['dropoff_longitude'])).miles for
                    x in trainFrame.iterrows()]
testFrame["gcDistance"] = [great_circle((x[1]['pickup_latitude'],
                                           x[1]['pickup_longitude']),
            (x[1]['dropoff_latitude'], x[1]['dropoff_longitude'])).miles for
                    x in testFrame.iterrows()]

In [46]:
trainNewDistanceMat = np.array(trainFrame[["vincentyDistance","gcDistance"]])
trainNewDistanceMat = sp.sparse.csc_matrix(trainNewDistanceMat)
testNewDistanceMat = np.array(testFrame[["vincentyDistance","gcDistance"]])
testNewDistanceMat = sp.sparse.csc_matrix(testNewDistanceMat)
#then add them in
trainFeatureMat = sp.sparse.hstack((trainFeatureMat,trainNewDistanceMat))
testFeatureMat = sp.sparse.hstack((testFeatureMat,testNewDistanceMat))

In [47]:
newDistanceMod = lm.LinearRegression()
newDistanceMod.fit(trainFeatureMat,trainFrame["logTripDuration"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [48]:
testFrame["logPredictions"] = newDistanceMod.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [49]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/newDistancePredictions.csv",
                   index = False)

Let's see how well these perform when we add hour interactions.

# Other Model Types

Let's see if a decision tree might work better in this context.

In [None]:
from sklearn import tree
treeReg = tree.DecisionTreeRegressor()
treeReg.fit(trainFeatureMat,trainFrame["logTripDuration"])

In [None]:
testFrame["logPredictions"] = treeReg.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [None]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/treeRegPredictions.csv",
                   index = False)

That fits slowly. Let's try a neural net! Because of course we have to use fucking neural nets.

In [54]:
from sklearn.neural_network import MLPRegressor

In [56]:
nnReg = MLPRegressor()
nnReg.fit(trainFeatureMat,trainFrame["logTripDuration"])

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [57]:
testFrame["logPredictions"] = nnReg.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [58]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/neuralNetRegPredictions.csv",
                   index = False)

That put us in the top $63\%$! Let's check out support vector regression.

In [None]:
from sklearn import svm
svReg = svm.SVR()
svReg.fit(trainFeatureMat,trainFrame["logTripDuration"])

In [None]:
testFrame["logPredictions"] = svReg.predict(testFeatureMat)
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [None]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/supportVectorRegPredictions.csv",
                   index = False)

In [50]:
from keras.models import Sequential
from keras.layers import Dense

Using Theano backend.


In [51]:
newNN = Sequential()
newNN.add(Dense(100,input_dim = trainFeatureMat.shape[1],
                activation = "relu"))
newNN.add(Dense(50,activation = "sigmoid"))
newNN.add(Dense(1,activation = "linear"))

In [53]:
newNN.compile(loss = "mean_squared_error",optimizer = "adam",
              metrics = ["accuracy"])

In [56]:
newNN.fit(trainFeatureMat.toarray(),np.array(trainFrame["logTripDuration"]),
          epochs = 2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x120a31f10>

In [58]:
testFrame["logPredictions"] = newNN.predict(testFeatureMat.toarray())
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [59]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/kerasRegPredictions.csv",
                   index = False)

That did really well! Let's try another neural net with an additional layer.

In [61]:
newNN = Sequential()
newNN.add(Dense(100,input_dim = trainFeatureMat.shape[1],
                activation = "relu"))
newNN.add(Dense(50,activation = "sigmoid"))
newNN.add(Dense(25,activation = "elu"))
newNN.add(Dense(1,activation = "linear"))

In [62]:
newNN.compile(loss = "mean_squared_error",optimizer = "adam",
              metrics = ["accuracy"])

In [63]:
newNN.fit(trainFeatureMat.toarray(),np.array(trainFrame["logTripDuration"]),
          epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x12111d1d0>

In [64]:
testFrame["logPredictions"] = newNN.predict(testFeatureMat.toarray())
testFrame["trip_duration"] = np.exp(testFrame["logPredictions"])

In [65]:
exportFrame = testFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/kerasSecondRegPredictions.csv",
                   index = False)

# Introduce OSRM Data

Let's see how well we can perform when we introduce OSRM data from [here](https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm). This will allow us to include information on the fastest route, including its travel time, number of steps, and total distance.

In [51]:
trainFastestRouteFrame_p1 = pd.read_csv(
                    "../data/preprocessed/osrm/fastest_routes_train_part_1.csv")
trainFastestRouteFrame_p2 = pd.read_csv(
                    "../data/preprocessed/osrm/fastest_routes_train_part_2.csv")
testFastestRouteFrame = pd.read_csv(
                    "../data/preprocessed/osrm/fastest_routes_test.csv")

In [52]:
trainFastestRouteFrame = pd.concat([trainFastestRouteFrame_p1,
                                    trainFastestRouteFrame_p2]

In [53]:
trainFrame = trainFrame.merge(trainFastestRouteFrame,on = "id",how = "left")
testFrame = testFrame.merge(testFastestRouteFrame,on = "id",how = "left")

Let's ensure there are no `NaN` features when adding these two variables.

In [54]:
trainCondition = ((trainFrame["total_distance"].notnull()) &
                  (trainFrame["total_travel_time"].notnull()) &
                  (trainFrame["number_of_steps"].notnull()))
testConditon = ((testFrame["total_distance"].notnull()) &
                  (testFrame["total_travel_time"].notnull()) &
                  (testFrame["number_of_steps"].notnull()))

In [55]:
filteredTrainFrame = trainFrame[trainCondition]
filteredTestFrame = testFrame[testConditon]

In [56]:
#get remainder indices
trainRemainderIndices = list(filteredTrainFrame.index)
testRemainderIndices = list(filteredTestFrame.index)

In [57]:
#then filter variables
trainFeatureMat = trainFeatureMat.tocsc()[trainRemainderIndices,:]
testFeatureMat = testFeatureMat.tocsc()[testRemainderIndices,:]

In [58]:
#then pull information into sparse matrices
osrmVars = ["total_distance","total_travel_time","number_of_steps"]
trainOSRMMat = sp.sparse.csc_matrix(filteredTrainFrame[osrmVars])
testOSRMMat = sp.sparse.csc_matrix(filteredTestFrame[osrmVars])

In [59]:
#then stack them
trainFeatureMat = sp.sparse.hstack((trainFeatureMat,trainOSRMMat))
testFeatureMat = sp.sparse.hstack((testFeatureMat,testOSRMMat))

In [61]:
#then build a new neural net
newNN = Sequential()
newNN.add(Dense(100,input_dim = trainFeatureMat.shape[1],
                activation = "relu"))
newNN.add(Dense(50,activation = "relu"))
newNN.add(Dense(25,activation = "elu"))
newNN.add(Dense(1,activation = "linear"))

In [62]:
newNN.compile(loss = "mean_squared_error",optimizer = "adam",
              metrics = ["accuracy"])

In [63]:
newNN.fit(trainFeatureMat.toarray(),
          np.array(filteredTrainFrame["logTripDuration"]),
          epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x4c24b5810>

In [64]:
filteredTestFrame["logPredictions"] = newNN.predict(testFeatureMat.toarray())
filteredTestFrame["trip_duration"] = np.exp(
                                filteredTestFrame["logPredictions"])

In [65]:
exportFrame = filteredTestFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/osrmReg2Predictions.csv",
                   index = False)

That didn't do any better. Let's see if we can do any better by changing up some of our activation functions.

In [80]:
#then build a new neural net
newNN = Sequential()
newNN.add(Dense(100,input_dim = trainFeatureMat.shape[1],
                activation = "relu"))
newNN.add(Dense(50,activation = "sigmoid"))
newNN.add(Dense(25,activation = "elu"))
newNN.add(Dense(10,activation = "relu"))
newNN.add(Dense(1,activation = "linear"))

In [81]:
newNN.compile(loss = "mean_squared_error",optimizer = "adam",
              metrics = ["accuracy"])

In [None]:
newNN.fit(trainFeatureMat.toarray(),
          np.array(filteredTrainFrame["logTripDuration"]),
          epochs = 3)

In [78]:
filteredTestFrame["logPredictions"] = newNN.predict(testFeatureMat.toarray())
filteredTestFrame["trip_duration"] = np.exp(
                                filteredTestFrame["logPredictions"])

In [79]:
exportFrame = filteredTestFrame[["id","trip_duration"]]
exportFrame.to_csv(
            "../data/processed/osrmMoreLayerRegPredictions.csv",
                   index = False)