.pandas is a software library written for the Python programming language for data manipulation and analysis.
.NumPymis a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
.Matplotlib is a plotting library for Python. It is used along with NumPy to provide an environment that is an effective open source alternative for MatLab.
.matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB.
.Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

In [28]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy  as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
mpl.style.use('seaborn')


# Problem definition

# preparing data

In [29]:
total_data=pd.read_csv("../../data/Processed/New York_Weather_cyclical_taxi.csv", parse_dates=['datetime'])
total_data.sample(5)

Unnamed: 0,datetime,temperature,humidity,pressure,wind_speed,wind_direction,rides,date,hour,month,day,year,hour_sin,hour_cos,day_sin,day_cos,month_sin,month_cos
7046,2015-10-21 14:00:00,13.96,62.0,1027.0,3.0,270.0,33,2015-10-21,14,10,21,2015,-0.631088,-0.775711,0.353676,0.935368,-0.866025,0.5
7111,2015-10-24 07:00:00,3.99,80.0,1028.0,3.0,48.0,43,2015-10-24,7,10,24,2015,0.942261,-0.33488,0.401488,0.915864,-0.866025,0.5
1781,2015-03-16 05:00:00,1.957667,74.0,1029.0,4.0,330.0,43,2015-03-16,5,3,16,2015,0.979084,0.203456,0.271958,0.962309,1.0,6.123234000000001e-17
5117,2015-08-02 05:00:00,21.97,82.0,1012.0,2.0,325.0,35,2015-08-02,5,8,2,2015,0.979084,0.203456,0.034422,0.999407,-0.866025,-0.5
1331,2015-02-25 11:00:00,-11.768,78.0,1007.0,3.0,276.0,52,2015-02-25,11,2,25,2015,0.136167,-0.990686,0.417194,0.908818,0.866025,0.5


For each city we have timeseries in a column. We are going to chose New York as our chosen city and  our chosen features. 

# feature engineering 

In [30]:
def get_encode_feature(total_data):
        return total_data[['day_cos','hour_cos','hour_sin','day_sin','month_sin','month_cos','temperature']]

In [31]:
def get_unencoded_feature(total_data):
    return total_data[['month', 'day', 'hour']]

Let's split our data into training and test sets.

In [32]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(total_data, test_size=0.4)
data_test, data_val = train_test_split(data_test, test_size=0.5)

In [33]:
X_train=get_unencoded_feature(data_train)
X_test=get_unencoded_feature(data_test)
y_train=data_train.rides
y_test=data_test.rides
threshold = 0.8
print('X_train', X_train.shape)
print('y_train', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)



X_train (5256, 3)
y_train (5256,)
X_test (1752, 3)
y_test (1752,)


# Model Training / Evaluation with encoded feature

In [34]:
X_train=get_unencoded_feature(data_train)
X_test=get_unencoded_feature(data_test)
y_train=data_train.rides
y_test=data_test.rides
threshold = 0.8
print('X_train', X_train.shape)
print('y_train', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)


X_train (5256, 3)
y_train (5256,)
X_test (1752, 3)
y_test (1752,)


In [35]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
model = LinearRegression()
#model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [36]:
print('MAE', mean_absolute_error(y_test, y_pred))
print('RMSE', np.sqrt(mean_squared_error(y_test, y_pred)))

MAE 5.23636277517
RMSE 6.28124816736


In [37]:
>>> from sklearn.externals import joblib
>>> joblib.dump(model, '../../model/regression.pkl') 

['../../model/regression.pkl']