Investigating Medals at the Olympic Games using Featuretools
Overview of Results
We make predictions for the medals won at various points throughout history. Using just the average number of medals won has an average AUC score of 0.74. When we use automated feature engineering, we can generate hundred of features and improve the score to 0.95 on average. Because the model is so accurate, we can see clear evidence of historical events that occur outside of our data.
Featuretools is a framework to perform automated feature engineering. It excels at transforming transactional and relational datasets into feature matrices for machine learning.
The notebooks here show how Featuretools:
- Simplifies data science-related code
- Enables us to ask innovative questions
- Avoids classic label-leakage problems
- Exhaustively generates hundreds of features
We do so by investigating the medals won by each country at each historical Olympic Games (dataset pulled from Kaggle). The dataset contains each medal won at each Olympic Games, including the medaling athlete, their gender, and their country and sport.
I'll generate a model using Featuretools that predicts whether or not a country will score more than 10 medals at the next Olympics. While it's possible to have some predictive accuracy without machine learning, feature engineering is necessary to improve the score.
pip install -r requirements.txt
The Olympic Games dataset is found here. Copy the three csv files into the
Detailed Description of Notebooks
Featuretools Basics: FeaturetoolsPredictiveModeling.ipynb
In this notebook, I'll explain how to use out-of-the-box methods from Featuretools to transform the raw Olympics dataset into a machine-learning-ready feature matrix. Along the way, I'll build a machine learning model and explore which features were the most predictive.
Baselines using Featuretools: BaselineSolution.ipynb
Machine learning performance scores should never be taken at face value. To have any merit, they must be compared against a simple baseline model to see how much improvement they produced. In this notebook, I'll construct a baseline solution leveraging Featuretools to easily build a custom feature.
Featuretools is an open source project created by Feature Labs. To see the other open source projects we're working on visit Feature Labs Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.
Any questions can be directed to firstname.lastname@example.org