# Travel times
##Description:	
A driver uses an app to track GPS coordinates as he drives to work and back each day. The app collects the location and elevation data. Data for about 200 trips are summarized in this data set.
##Data source:	
* Date of travel
* StartTime: when getting into the car
* DayOfWeek: the day name
* GoingTo: direction of travel
* Distance travelled in kilometers
* MaxSpeed: fastest speed recorded (all trips are on the 407 highway for some portion)
* AvgSpeed: the average speed for the entire trip
* AvgMovingSpeed: the average speed recorded only while the car is moving
* FuelEconomy: a rough estimate of fuel economy (it is inaccurate)
* TotalTime: duration of the entire trip, in minutes
* MovingTime: duration when the car was considered to be moving (i.e. not counting traffic delays, accidents, or time while the car is stationary)
* Take407All: is Yes if the 407 toll highway was taken for the entire trip. I try to avoid taking the 407, taking slower back routes to save costs. But some days I'm running late, or just lazy, and take it all the way.

##Comments
* Data shape:	205 rows and 13 columns
* Usage restrictions:	None
* Contact person:	Kevin Dunn
* Contact details:	datasets@connectmv.com
* Added here on:	17 September 2015
* Last updated:	09 January 2012 13:21

#Task

Build a regression model to predict the total travel time.

##Extension
Many of the data points in this data set are only available _after_ the drive. But what if we want to predict how long it will take _before_ we leave?

Build a model that makes a prediction based only on data available to the driver _before_ they leave?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


##1. Read the data into a data frame using pandas.

##2. Create X and y matrices

Take only the columns which have numeric values and make that the `X` matrix. Make the `TotalTime` the `y` vector. Make sure that the Total Time *isn't* in your X matrix!

##3. Build a baseline KNN model with k=10

In [2]:
from sklearn.neighbors import KNeighborsRegressor



## 4. Calculate the root mean squared error of the baseline.

In [3]:
from sklearn.metrics import mean_squared_error



## Challenge 1: Use Cross-Validation to compare performance of different values of k.

In [4]:
from sklearn.cross_validation import KFold

# function from lecture notes
def kfold_cv_error(model, X, y, folds):
    test_error = []
    for train_index, test_index in folds:
        model.fit(X[train_index], y[train_index])
        predict = model.predict(X[test_index])
        fold_error = np.sqrt(mean_squared_error(y[test_index], predict))
        test_error.append(fold_error)
    return np.mean(test_error)



## Challenge 2: Transform non-numeric data

* Make the `Take407All` variable a boolean column.
* Make the `GoingTo` column a boolean column.
* Make the pandas function `get_dummies` to create dummy columns for the DayOfWeek