This is the onboarding document for the TWDE Datalab. If you want to get involved, find something confusing, or just want to say hi, please open an issue or join the Slack channel. Please also check out the slide deck we use for our hands-on workshops.
- Introduction
- Data
- Workflow
- Getting Started Locally
- Ways To Get Involved
- Next Steps: Getting started on AWS
It's our goal to onboard you to the basics of data science as quickly and thoroughly as possible. We've selected a challenge from kaggle.com that, broadly speaking, compares to a realistic problem we would tackle for clients. The specific problem is demand forecasting for an Ecuadorian grocery company. For details, see the Favorita Grocery Sales Forecasting Kaggle competition.
The competition provides 4 years of purchasing history along with data about things like the price of oil (Ecuador is a net exporter of oil), and public holidays in Ecuador. Our goal is to analyze this data, plus any other data we acquire (see the external data discussion on kaggle), and produce an estimated unit sales
for each item in each store for a given time period.
To make it easier to get started, we provide a data set that is a subset of the original data. The sample consists of only one type of store in one city (Quito), and only includes transaction data from the last year. This dramatically reduces the size of the data, which limits our predictive capabilities, but will be more than enough to get started with thorough analysis.
We have structured our workflow into five steps: merging, splitting, training, predicting, and validating.
Each step has a correspondingly named .py file except training and predicting which are put together in an algorithm-specific file such as decision_tree.py.
We provide two functioning machine learning models: a simple decision tree and a time series forecasting model. Check out the description for our decision tree pipeline for details about our implementation.
This project expects Python 3 to be used. The easiest way to get Python 3 is by using Anaconda.
git clone https://github.com/ThoughtWorksInc/twde-datalab && cd twde-datalab
pip install -r requirements.txt
sh run_decisiontree_pipeline.sh
After running the pipeline, which can take a while the first time as it downloads the reduced raw data from our public S3 bucket (s3://twde-datalab/raw), the output data will be stored in folders in the corresponding to the file that created them, e.g.
data/merger/bigTable.csv
data/splitter/train.csv
data/splitter/validation.csv
data/decision_tree/model.pkl
data/decision_tree/score_and_metadata.csv
If running the decision tree pipeline worked without error, you should see a score (error rate) of ~0.83 being printed. The lower the score the better the prediction, so you are ready to start science-ing on your own! Next, you can consider:
- Reading about how (and why) we implement the pipeline for the decision tree the way we do
- Run the tests: just run
pytest
in the projects root directory - Doing some exploratory analysis and document what you find
- Coming up with a hypothesis about some feature engineering tasks and test your hypothesis
- See also "Ways To Get Involed" below, or search our issues for more things!
There are many low hanging fruit ready to be picked by you, dear reader, if you want to get involved in the Data Science world at ThoughtWorks. You should look to the issues on this repository for specifics or to ask for guidance. Categorically, some of the possible next steps include:
- Use more features for existing algorithms
- Daily weather
- Price of oil
- Natural disasters
- Political unrest
- Hyperparameterize the Existing Machine Learning Algorithms
- Try Different Models
- Random forest
- Neural network
- Time series regression for each item
- Improve Validation Strategy
- Use 30% of the data to validate
- Don't only validate on the last time period of the training data
- Improving the pipeline setup
- Can data preprocessing be used for multiple algorithms?
- Streamlining deployment on AWS
The maintainers of the repository (Emma, Jin, Arif) will be happy to help you get started. You are also welcome to join the discussion in our Slack channel at https://tw-datalab.slack.com (simply use your TW mail address to sign up).
Let's get started!
The default master branch represents a simplified version of our work that is optimised for comprehensibility and for local development, which is also why a dramatically downsized dataset is used. For really validating whether an improvement or alternative approach you applied significantly increases prediction quality you probably want to run the training on the entire dataset and you probably don't want to run it locally.
That is what the branch https://github.com/ThoughtWorksInc/twde-datalab/tree/run-on-aws is all about (a.k.a. the 'pro' branch).
IMPORTANT: The software in the Git repository does not contains AWS credentials or any other way to access an AWS account. So, please make sure you have access to an AWS account. If you want to use the AWS account of the TWDE Datalab reach out the maintainers.
We have been exploring different ways to deploy the code on AWS. Our first approach was through creating Elastic Map Reduce clusters, but since we settled on pandas instead of Spark at some point, we haven't been doing distributed computing very much. Therefore, there are two main ways we are using AWS resources: AWS Data Pipeline and Jupyter on EC2. We have been using the former to run our decision tree model on larger data sets and the latter (Jupyter on EC2) to run the Prophet time series model.
If you haven't done so, install the AWS command line tools. If you are doing this now, please don't forget to configure your credentials, too.
pip install awscli
aws configure
(this will ask you for your credentials and store them in~/.aws
)
Now run a deployment script from the deployment
directory
cd deployment
./deploy-pipeline.sh -j all -n {name for the pipeline goes here}
This script will do the following:
- create a shell script based on
run_pipeline.sh
- upload the shell script to S3
- create an AWS data pipeline following
pipeline-definition.json
- start the pipeline
The output (and logs) are available via the AWS console. Unfortunately, we've run into some issues with large file sizes, which are documented here #25.
Another, maybe even simpler way to exploit cloud computing, is by installing Anaconda on AWS EC2 instance and setting up Jupyter Notebooks on AWS.
For running our Prophet time series model, we published a ready to go AMI image tw_datalab_prophet_forecast_favorita
that already includes the relevant Jupyter notebooks.
Just search for this image in 'Community AMIs' when launching an EC2 machine and make sure you open port 8888.
Then ssh into your machine and start the Jupyter server:
jupyter notebook --no-browser --port=8888
Afterwards you should be able to open Jupyter in your browser at https://ec2-{public-ip-of-ec2-machine}.{my-region}.compute.amazonaws.com:8888. When asked for a password, simply type 'datalab'.