# Demo1  NYC Taxi Fare Prediction 


### Basic Starter Kernel example  
https://www.kaggle.com/dster/nyc-taxi-fare-starter-kernel-simple-linear-model/notebook

### Based on: Pandas, numpy, simple linearl mode

### use pytorch_env


In [1]:
# Initial Python environment setup...
import numpy as np # linear algebra
import pandas as pd # CSV file I/O (e.g. pd.read_csv)
import os # reading the input files we have access to
from datetime import datetime


print(os.listdir('./input'))

['GCP-Coupons-Instructions.rtf', 'train.csv.zip', 'train.csv', 'sample_submission.csv', 'test.csv']


### Setup training data
First let's read in our training data. Kernels do not yet support enough memory to load the whole dataset at once, at least using pd.read_csv. The entire dataset is about 55M rows, so we're skipping a good portion of the data, but it's certainly possible to build a model using all the data.

In [2]:
time1=datetime.now()


train_df =  pd.read_csv('./input/train.csv')
train_df.dtypes

time2=datetime.now()
data_load_time=time2-time1
print("Data Load Consuming Time:")
print(data_load_time)
# Given a dataframe, add two new features 'abs_diff_longitude' and
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

add_travel_vector_features(train_df)

Data Load Consuming Time:
0:02:21.900109


### Data Preparation
Explore and prune outliers

First let's see if there are any NaNs in the dataset.

In [3]:
print(train_df.isnull().sum())


key                     0
fare_amount             0
pickup_datetime         0
pickup_longitude        0
pickup_latitude         0
dropoff_longitude     376
dropoff_latitude      376
passenger_count         0
abs_diff_longitude    376
abs_diff_latitude     376
dtype: int64


In [4]:
time3=datetime.now()

print('Old size: %d' % len(train_df))
train_df = train_df.dropna(how = 'any', axis = 'rows')
print('New size: %d' % len(train_df))

#There are further steps for the data preparation, but we simply skiped it here
# E.g., some data ponints longitude > 5
# E.g., some data points in water -> https://www.kaggle.com/breemen/nyc-taxi-fare-data-exploration

#plot = train_df.iloc[:2000].plot.scatter('abs_diff_longitude', 'abs_diff_latitude')
#print('Old size: %d' % len(train_df))
#train_df = train_df[(train_df.abs_diff_longitude < 5.0) & (train_df.abs_diff_latitude < 5.0)]
#print('New size: %d' % len(train_df))

time4=datetime.now()
data_processing_time=time4-time3
print("Data Processing Consuming Time:")
print(data_processing_time)
data_prepare_time=data_load_time+data_processing_time
print("Data prepare Consuming Time:")
print(data_prepare_time)

Old size: 55423856
New size: 55423480
Data Processing Consuming Time:
0:00:16.825389
Data prepare Consuming Time:
0:02:38.725498


We expect most of these values to be very small (likely between 0 and 1) since it should all be differences between GPS coordinates within one city. For reference, one degree of latitude is about 69 miles. However, we can see the dataset has extreme values which do not make sense. Let's remove those values from our training set. Based on the scatterplot, it looks like we can safely exclude values above 5 (though remember the scatterplot is only showing the first 2000 rows...)



### Train our model
Our model will take the form  **X⋅w=y  where  X  is a matrix of input features, and  y  is a column of the target variable, fare_amount, for each row**. The weight column  w  is what we will "learn".

First let's setup our input matrix  X  and target column  y  from our training set. The matrix  X  should consist of the two GPS coordinate differences, plus a third term of 1 to allow the model to learn a constant bias term. The column  y  should consist of the target fare_amount values.

In [5]:
# Construct and return an Nx3 input matrix for our linear model
# using the travel vector, plus a 1.0 for a constant bias term.

def get_input_matrix(df):
    return np.column_stack((df.abs_diff_longitude, df.abs_diff_latitude, np.ones(len(df))))

train_X = get_input_matrix(train_df)
train_y = np.array(train_df['fare_amount'])

print(train_X.shape)
print(train_y.shape)

(55423480, 3)
(55423480,)


Now let's use numpy's lstsq （LeaST SQuare） library function to find the optimal weight column  w .

a=np.linalg.lstsq(x,b),b=a*x



In [6]:
time5=datetime.now()

# The lstsq function returns several things, and we only care about the actual weight vector w.
(w, _, _, _) = np.linalg.lstsq(train_X, train_y, rcond = None)
print(w)

time6=datetime.now()
model_train_time=time6-time5
print("Model Train Consuming Time:")
print(model_train_time)

[9.14423776e-03 4.63805262e-04 1.13431261e+01]
Model Train Consuming Time:
0:00:04.007546


In [7]:
#   numpy.matmul 
time7=datetime.now()

w_OLS = np.matmul(np.matmul(np.linalg.inv(np.matmul(train_X.T, train_X)), train_X.T), train_y)
print(w_OLS)
time8=datetime.now()
model_train_time_OLS=time8-time7
print("Model Train OLS Consuming Time:")
print(model_train_time_OLS)

[9.14423776e-03 4.63805262e-04 1.13431261e+01]
Model Train OLS Consuming Time:
0:00:01.105275


These weights pass a quick sanity check, since we'd expect the first two values -- the weights for the absolute longitude and latitude differences -- to be positive, as more distance should imply a higher fare, and we'd expect the bias term to loosely represent the cost of a very short ride.

Sidenote: we can actually calculate the weight column  w  directly using the Ordinary Least Squares method:  w=(XT⋅X)−1⋅XT⋅y

### Make Prediction on train.csv data set 


In [8]:
import torch
import torch.nn as nn
import numpy as np
from sklearn.model_selection import train_test_split


X_train, X_evalutation, y_train, y_evalutation = train_test_split(train_X, train_y, test_size = 0.3, random_state = 0)



y_evaluation_result = np.matmul(X_evalutation, w).round(decimals = 2)



def rmse(x, y):
    return np.sqrt(((x - y) ** 2).mean())

RMSE=rmse(y_evaluation_result,y_evalutation)

print("RMSE Value based on train data set:")
print(RMSE)


print(os.listdir('.'))

RMSE Value based on train data set:
18.009408916465482
['input', 'submission.csv', '.ipynb_checkpoints', 'Simple Liner Model.ipynb']


### Make predictions on the test set
Now let's load up our test inputs and predict the fare_amounts for them using our learned weights!

In [None]:
test_df = pd.read_csv('./input/test.csv')
test_df.dtypes

In [None]:
add_travel_vector_features(test_df)
test_X = get_input_matrix(test_df)
# Predict fare_amount on the test set using our model (w) trained on the training set.
test_y_predictions = np.matmul(test_X, w).round(decimals = 2)

# Write the predictions to a CSV file which we can submit to the competition.
submission = pd.DataFrame(
    {'key': test_df.key, 'fare_amount': test_y_predictions},
    columns = ['key', 'fare_amount'])
submission.to_csv('submission.csv', index = False)

print(os.listdir('.'))

##  RMSE: $5.74, if evaluated on the website to the website 

### Ideas for Improvement
The output here will score an RMSE of $5.74, but you can do better than that! Here are some suggestions:

Use more columns from the input data. Here we're only using the start/end GPS points from columns [pickup|dropoff]_[latitude|longitude]. Try to see if the other columns -- pickup_datetime and passenger_count -- can help improve your results.
Use absolute location data rather than relative. Here we're only looking at the difference between the start and end points, but maybe the actual values -- indicating where in NYC the taxi is traveling -- would be useful.
Use a non-linear model to capture more intricacies within the data.
Try to find more outliers to prune, or construct useful feature crosses.
Use the entire dataset -- here we're only using about 20% of the training data!