# New York City Taxi Fare Prediction - Simple Linear Model
This is a basic program for the New York City Taxi Fare Prediction.
Here we'll use a simple linear model based on the travel vector from the taxi's pickup location to dropoff location which predicts the fare_amount of each ride.

This kernel uses some pandas and mostly numpy for the critical work. There are many higher-level libraries you could use instead, for example sklearn or statsmodels.

In [1]:
# Initial Python environment setup...
import numpy as np # linear algebra
import pandas as pd # CSV file I/O (e.g. pd.read_csv)
import os # reading the input files we have access to

## Setup training data
First let's read in our training data. Kernels do not yet support enough memory to load the whole dataset at once, at least using pd.read_csv. The entire dataset is about 55M rows, so we're skipping a good portion of the data, but it's certainly possible to build a model using all the data.

In [2]:
train_df =  pd.read_csv('E:/NYC_Fare/train1.csv')
train_df.dtypes

key                   object
fare_amount          float64
pickup_datetime       object
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
passenger_count        int64
distance_miles       float64
dtype: object

Let's create two new features in our training set representing the "travel vector" between the start and end points of the taxi ride, in both longitude and latitude coordinates. We'll take the absolute value since we're only interested in distance traveled. Use a helper function since we'll want to do the same thing for the test set later.

In [3]:
# Given a dataframe, add two new features 'abs_diff_longitude' and
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

add_travel_vector_features(train_df)

## Train our model
Our model will take the form  X⋅w=y where  X is a matrix of input features, and  y  is a column of the target variable,  fare_amount, for each row. The weight column  w is what we will "learn".

First let's setup our input matrix  X and target column  y  from our training set. The matrix  X should consist of the two GPS coordinate differences, plus a third term of 1 to allow the model to learn a constant bias term. The column  y should consist of the target fare_amount values.

In [4]:
# Construct and return an Nx3 input matrix for our linear model
# using the travel vector, plus a 1.0 for a constant bias term.
def get_input_matrix(df):
    return np.column_stack((df.abs_diff_longitude, df.abs_diff_latitude, np.ones(len(df))))

train_X = get_input_matrix(train_df)
train_y = np.array(train_df['fare_amount'])

print(train_X.shape)
print(train_y.shape)

(195767, 3)
(195767,)


In [5]:
# The lstsq function returns several things, and we only care about the actual weight vector w.
(w, _, _, _) = np.linalg.lstsq(train_X, train_y, rcond = None)
print(w)

[164.4579912  112.90727954   5.15986424]


These weights pass a quick sanity check, since we'd expect the first two values -- the weights for the absolute longitude and latitude differences -- to be positive, as more distance should imply a higher fare, and we'd expect the bias term to loosely represent the cost of a very short ride.

Sidenote: we can actually calculate the weight column  w  directly using the Ordinary Least Squares method: 
$w = (X^T \cdot X)^{-1} \cdot X^T \cdot y$

In [6]:
w_OLS = np.matmul(np.matmul(np.linalg.inv(np.matmul(train_X.T, train_X)), train_X.T), train_y)
print(w_OLS)

[164.4579912  112.90727954   5.15986424]


## Make predictions on the test set
Now let's load up our test inputs and predict the fare_amounts for them using our learned weights!

In [7]:
test_df = pd.read_csv('E:/NYC_Fare/test1.csv')
test_df.dtypes

key                   object
pickup_datetime       object
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
passenger_count        int64
dtype: object

In [8]:
# Reuse the above helper functions to add our features and generate the input matrix.
add_travel_vector_features(test_df)
test_X = get_input_matrix(test_df)
# Predict fare_amount on the test set using our model (w) trained on the training set.
test_y_predictions = np.matmul(test_X, w).round(decimals = 2)

# Write the predictions to a CSV file which we can submit to the competition.
submission = pd.DataFrame(
    {'key': test_df.key, 'fare_amount': test_y_predictions},
    columns = ['key', 'fare_amount'])
submission.to_csv('E:/NYC_Fare/prediction1.csv', index = False)

## Ideas for Improvement
The output here will score an RMSE of $5.74, but you can do better than that! Here are some suggestions:

Use more columns from the input data. Here we're only using the start/end GPS points from columns [pickup|dropoff]_[latitude|longitude]. Try to see if the other columns -- pickup_datetime and passenger_count -- can help improve your results.
Use absolute location data rather than relative. Here we're only looking at the difference between the start and end points, but maybe the actual values -- indicating where in NYC the taxi is traveling -- would be useful.
Use a non-linear model to capture more intricacies within the data.
Try to find more outliers to prune, or construct useful feature crosses.
Use the entire dataset -- here we're only using about 20% of the training data!

In [9]:
groundtruth = pd.read_csv('E:/NYC_Fare/value1.csv')

In [10]:
print(groundtruth.shape)
print(submission.shape)

(9810, 2)
(9810, 2)


In [11]:
trueArray = np.array(groundtruth['fare_amount'])
predArray = np.array(submission['fare_amount'])
print(len(trueArray),len(predArray))

9810 9810


In [12]:
rmse = np.sqrt(np.mean(np.square(trueArray-predArray)))
print(rmse)

6.416984327495839
