Team Data Science Process Lifecycle
Team Data Science Process (TDSP) provides a recommended lifecycle that you can use to structure your data science projects. Basically, the lifecycle defines the steps that projects executing using the TDSP follow from start to finish. If you are using another lifecycle such as CRISP-DM, KDD or your own custom process that is working well in your organization, you can still use TDSP in the context of those development lifecycles. It is to be noted that this lifecycle is in the context of data science projects that lead to building data products and intelligent applications that include predictive analytics using machine learning or artificial intelligence (AI) models that are productionized. Exploratory data science projects and adhoc / on-off analytics projects can use this process but some steps of this lifecycle may not be needed.
Here is a depiction of the TDSP lifecycle.
The TDSP data science lifecycle is composed of five major stages that are executed iteratively. This includes:
- Business Understanding
- Data Acquisition and Understanding
- Customer Acceptance
We describe each stage in detail.
1. Business Understanding
The goals of this stage are:
- Clearly and explicitly specify the model target(s) as 'sharp' question(s) which is used to drive the customer engagement.
- Clearly specifying where to find the data sources of interest. Define the predictive model target in this step and determine if we need to bring in ancillary data from other sources.
How to do it
In this stage, you work with your customer and stakeholder to understand the business problems that can be greatly enhanced with predictive analytics. A central objective of this step is to identify the key business variables (sales forecast or the probability of an order being fraudulent, for example) that the analysis needs to predict (also known as model targets) to satisfy these requirements. In this stage you also to develop an understanding of the data sources needed to address the objectives of the project from an analytical perspective. There are two main aspects of this stage - Define Objectives and Identify data sources.
1.1 Define Objectives
- Understand the customer business domain, key variables by which success is defined in that space. Then understand what business problems are we trying to address using data science to affect those key metrics?
- Define the project goals with 'sharp' question(s). A fine description of what a sharp question is, and how you can ask it, can be found in this article. As per the article, here is a very useful tip to ask a sharp question - "When choosing your question, imagine that you are approaching an oracle that can tell you anything in the universe, as long as the answer is a number or a name". Data science / machine learning is typically used to answer these five types of questions:
- How much or how many? (regression)
- Which category? (classification)
- Which group? (clustering)
- Is this weird? (anomaly detection)
- Which option should be taken? (recommendation)
Define the project team, the role and responsibilities. Develop a high level milestone plan that you iterate upon as more information is discovered.
Define success metrics. The metrics must be SMART (Specific, Measurable, Achievable, Relevant and Time-bound). For example: Achieve customer churn prediction accuracy of X% by the end of this 3-month project so that we can offer promotions to reduce churn.
1.2 Identify Data Sources
Identify data sources that contain known examples of answers to the sharp questions. Look for the following:
- Data that is Relevant to the question. Do we have measures of the target and features that are related to the target?
- Data that is an Accurate measure of our model target and the features of interest.
It is not uncommon, for example, to find that existing systems need to collect and log additional kinds of data to address the problem and achieve the project goals. In this case, you may want to look for external data sources or update your systems to collect newer data.
The following are the deliverables in this stage.
- Charter Document : A standard template is provided in the TDSP project structure definition. This is a living document that is updated throughout the project as new discoveries are made and as business requirements change. The key is to iterate upon this document with finer details as you progress through the discovery process. Be sure to keep the customer and stakeholders involved in the changes and clearly communicate the reasons for the change.
- Data Sources: This is part of the Data Report that is found in the TDSP project structure. It describes the sources for the raw data. In later stages you will fill in additional details like scripts to move the data to your analytic ernvironment.
- Data Dictionaries : This document provides the descriptions and the schema (data types, information on any validation rules) of the data which will be used to answer the question. If available, the entity-relation diagrams or descriptions are included too.
2. Data Acquisition and Understanding
The goals for this stage are:
- Ingest the data into the target analytic environment
- To determine if the data we have can be used to answer the question.
How to do it
In this stage, you will start developing the process to move the data from the source location to the target locations where the analytics operations like training and predictions (also known as scoring) will be run. For technical details and options on how to do this on various Azure data services, see Load data into storage environments for analytics.
Before you train your models, you need to develop a deep understanding about the data. Real world data is often messy with incomplete or incorrect data. By data summarization and visualization of the data, you can quickly identify the quality of your data and inform how to deal with the data quality. For guidance on cleaning the data, see this article.
Data visualization can be particularly useful to answer questions like - Have we measured the features consistently enough for them to be useful or are there a lot of missing values in the data? Has the data been consistently collected over the time period of interest or are there blocks of missing observations? If the data does not pass this quality check, we may need to go back to the previous step to correct or get more data.
Otherwise, you can start to better understand the inherent patterns in the data that will help you develop a sound predictive model for your target. Specifically you look for evidence for how well connected is the data to the target and whether the data is large enough to move forward with next steps. As we determine if the data is connected or if we have enough data, we may need to find new data sources with more accurate or more relevant data to complete the data set initially identified in the previous stage. TDSP also provides automated utility called IDEAR to help visualize the data and prepare data summary reports. We recommend starting with IDEAR first to explore the data to help develop initial data understanding interactively with no coding and then write custom code for data exploration and visualization.
In addition to the initial ingestion of data, you will typically need to setup a process to score new data or refresh the data regularly as part of an ongoing learning process. This can be done by setting up a data pipeline or workflow. Here is an example of how to setup a pipeline with Azure Data Factory. A solution architecture of the data pipeline is developed in this stage. The pipeline is developed in parallel in the following stages of the data science project. The pipeline may be batch based or a streaming/real-time or a hybrid depending on your business need and the constraints of your existing systems into which this solution is being integrated.
The following are the deliverables in this stage.
Data Quality Report : This report contains data summaries, relationships between each attribute and target, variable ranking etc. The IDEAR tool provided as part of TDSP can help with the quickly generating this report on any tabular dataset like a CSV or relational table.
Solution Architecture: This can be a diagram and/or description of your data pipeline used to run scoring or predictions on new data once you have built a model. It will also contain the pipeline to retrain your model based on new data. The document is stored in this directory when using the TDSP directory structure template.
Checkpoint Decision: Before we begin to do the full feature engineering and model building process, we can reevaluate the project to determine value in continuing this effort. We may be ready to proceed, need to collect more data, or it’s possible the data does not exist to answer the question.
The goals for this stage are:
- Develop new attributes or data features (also known as feature engineering), for building the machine learning model.
- Construct and evaluate an informative model to predict the target.
- Determine if we have a model that is suitable for production use
How to do it
There are two main aspects in this stage - Feature Engineering and Model training. They are described in following sub-sections.
3.1 Feature Engineering
Feature engineering involves inclusion, aggregation and transformation of raw variables to create the features used in the analysis. If we want insight into what is driving the model, then we need to understand how features are related to each other, and how the machine learning method will be using those features. This step requires a creative combination of domain expertise and the insights obtained from the data exploration step. This is a balancing act of including informative variables without including too many unrelated variables. Informative variables will improve our result; unrelated variables will introduce unnecessary noise into the model. You will also need to be able to generate these features for new data during scoring. So there should not be any dependency on generating these features on any piece of data that is unavailable at the time of scoring. For technical guidance on feature engineering when using various Azure data technologies, see this article.
3.2 Model Training
Depending on type of question you are trying answer, there are multiple modeling algorithms options available. For guidance on choosing the algorithms, see this article. NOTE: Though this article is written for Azure Machine Learning, it should be generally useful even when using other frameworks.
The process for model training is:
- The input data for modeling is usually split randomly into a training data set and a test data set.
- The models are built using the training data set.
- Evaluate (training and test dataset) a series of competing machine learning algorithms along with the various associated tuning parameters (also known as parameter sweep) that are geared toward answering the question of interest with the data we currently have at hand.
- Determine the “best” solution to answer the question by comparing the success metric between alternative methods.
[NOTE] Avoid leakage: Leakage is caused by including variables that can perfectly predict the target. These are usually variables that may have been used to detect the target initially. As the target is redefined, these dependencies can be hidden from the original definition. To avoid this often requires iterating between building an analysis data set, and creating a model and evaluating the accuracy. Leakage is a major reason data scientists get nervous when they get really good predictive results.
We provide a Automated Modeling and Reporting tool with TDSP that is able to run through multiple algorithms and parameter sweeps to produce a baseline model. It will also produce a baseline modeling report summarizing performance of each model and parameter combination including variable importance. This can further drive further feature engineering.
The artifacts produced in this stage includes:
Feature Sets: The features developed for the modeling are described in the in the Feature Set section of the Data Definition report. It contains pointers to the code to generate the features and description on how the feature was generated.
Modeling Report: For each models that are tried, a standard report following a specified TDSP template is produced. The
Checkpoint Decision: We evaluate whether the model performing is acceptable enough to deploy it to a production system here. Some of the questions to ask include:
- Does the model answer the question sufficiently given the test data?
- Should we go back and collect more data or do more feature engineering or try other algorithms?
Deploy models and pipeline to a production or production-like environment for final user acceptance.
How to do it
Once we have a set of models that perform well, they can be operationalized for other applications to consume. Depending on the business requirements, predictions are made either in real time or on a batch basis. To be operationalized, the models have to be exposed with an open API interface that is easily consumed from various applications such online website, spreadsheets, dashboards, or line of business and backend applications. See example of model operationalization with Azure Machine Learning web service in this article. It is also a good idea to build in telemetry and monitoring of the production model deployment and the data pipeline to help with system status reporting and troubleshooting.
- Status dashboard of system health and key metrics
- Final modeling report with deployment details
- Final solution architecture document
5. Customer Acceptance
To finalize the project deliverables by confirming the pipeline, the model, and their deployment in a production environment.
How to do it
The customer would validate that the system meet their business need and the answers the questions with acceptable accuracy to deploy the system to production for use by their client application. All the documentation are finalized and reviewed. A handoff of the project to the entity responsible for operations is done. Thbis could be an IT or data science team at the customer or an agent of the customer that will be responsible for running the system in production.
The main artifact produced in this final stage is the Project Final Report. This is the project technical report containing all details of the project that useful to learn and operate the system. A template is provided by TDSP that can be used as is or customized for specific client needs.
We have seen the Team Data Science Process lifecycle which is modeled as a sequence of iterated steps that provide guidance on the tasks needed to use predictive models that can be deployed in a production environment to be leveraged to build intelligent applications. The goal of this process lifecycle is to continue to move a data science project forward towards a clear engagement end point. While it is true that data science is an exercise in research and discovery, being able to clearly communicate this to customers using a well defined set of artifacts in a standardized template can help avoid misunderstanding and increase the odds of success.