Predict the total ride duration of taxi trips in New York City

Featuretools is a framework to perform automated feature engineering. It excels at transforming transactional and relational datasets into feature matrices for machine learning. This demo uses Featuretools to develop a prediction model for the New York City Taxi Trip Duration on Kaggle.

Normally, solving Kaggle problems is a very iterative process. Competitors look at the dataset, determine what features they can extract, and score it with their model. They use that accuracy to make more changes to their feature extraction, and again score their model. Featuretools simplifies to process to let you extract numerous features in one iteration.

Highlights

We can see that using Featuretools allows us to acheive better results. Featuretools is used in notebook 3 and notebook 4, both of which score in a higher percentile than the baseline score.

Running the tutorial

Clone the repo

git clone https://github.com/Featuretools/predict-taxi-trip-duration.git

Install the requirements

Mac OS
```
brew install libomp
pip install -r requirements.txt
```
Linux
```
sudo apt-get install build-essential
pip install -r requirements.txt
```
You will also need to install graphviz for this demo. Please install graphviz according to the instructions in the Featuretools Documentation
Download the data

You can download the data from Kaggle. After downloading, save the CSV files to a directory called data in the root of this repository.
Run the Tutorial notebook

jupyter notebook

Results

Comparing the Kaggle Scores for the notebooks also shows the better results acheived. The leaderboard score for the most advanced notebook is very close to the best score.

FAQ

Q: Why remove the outliers in the train data?

Trips that are outside of the 99th quantile for trip length will unduly skew all of our numbers and results. Let's remove them. This will remove only 14593 out of the nearly 1.5 million trips from the train dataset.

Some of the trips might have a high extremely trip duration. When we check those points, some of the passengers are traveling into the Atlantic ocean. Not only are these points outliers, they also probably don’t correspond to real travel information. By cutting out extremal values, we can train a regressor that is a better fit for most values.

Q: Why is `dropoff_datetime` present in the train data but not in the test data?

According to the Kaggle website:

The decision was made to not remove dropoff coordinates from the dataset order to provide an expanded set of variables to use in Kernels.

Since the dropoff_datetime was not present in the test dataset, we removed it. It also doesn’t make sense to use it since a taxi driver wouldn’t necessarily know how long a trip when picking someone up.

Q: What is `drop_contains`?

It is a list of strings which will tell DFS to drop any features which match the strings.

Q: Why is `trips.test_data` in `drop_contains`?

We don't want any features to be generated on the test_data column. The column is simply there to differentiate between train and test data. By putting the entity, followed by a dot, and the column name, it tell DFS to drop any aggregation features of test_data. If we had put just test_data in drop_contains, then it would have dropped the test_data column and the aggregation features of test_data.

Q: What is the model being used?

XGBoost, which stands for eXtreme Gradient Boosting, is the model used. It is a very popular machine learning algorithm in Kaggle competitions for structured or tabular data. More infromation can be found here.

Feature Labs

Featuretools is an open source project created by Feature Labs. To see the other open source projects we're working on visit Feature Labs Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.

Contact

Any questions can be directed to help@featurelabs.com

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
NYC Taxi 1 - Mean Duration.ipynb		NYC Taxi 1 - Mean Duration.ipynb
NYC Taxi 2 - Baseline Solution.ipynb		NYC Taxi 2 - Baseline Solution.ipynb
NYC Taxi 3 - Simple Featuretools.ipynb		NYC Taxi 3 - Simple Featuretools.ipynb
NYC Taxi 4 - Custom Featuretools Primitives.ipynb		NYC Taxi 4 - Custom Featuretools Primitives.ipynb
README.md		README.md
requirements.txt		requirements.txt
taxi_utils.py		taxi_utils.py

License

alteryx/predict-taxi-trip-duration

Folders and files

Latest commit

History

Repository files navigation

Predict the total ride duration of taxi trips in New York City

Highlights

Running the tutorial

Mac OS

Linux

Results

FAQ

Q: Why remove the outliers in the train data?

Q: Why is dropoff_datetime present in the train data but not in the test data?

Q: What is drop_contains?

Q: Why is trips.test_data in drop_contains?

Q: What is the model being used?

Feature Labs

Contact

About

Resources

License

Stars

Watchers

Forks

Languages

Q: Why is `dropoff_datetime` present in the train data but not in the test data?

Q: What is `drop_contains`?

Q: Why is `trips.test_data` in `drop_contains`?