# How to write production level code in data science

1. Keep it modular
This is basically a software design technique recommended for any software engineer. The idea here is to break a large code into small independent sections(functions) based on its functionality. There are two parts to it.

(i) Break the code into smaller pieces each intended to perform a specific task (may include sub tasks)

(ii) Group these functions into modules (or python files) based on its usability. It also helps in staying organized and ease of code maintainability

The first step is to decompose a large code into many simple functions with specific inputs (and input formats) and outputs (and output formats). As mentioned earlier, each function should perform a single task such as cleanup outliers in the data, replace erroneous values, score a model, calculate root-mean-squared error (RMSE), and so on. Try to break each of those functions further down to performing sub tasks and continue till none of the functions can be further broken down.

Low-level functions — the most basic functions that cannot be further decomposed. For example, computing RMSE or Z-score of the data. Some of these functions can be widely used for training and implementation of any algorithm or machine learning model.

Medium-level functions — a function that uses one or more of low-level functions and/or other medium-level functions to perform its task. For instance, cleanup outliers function use compute Z-score function to remove the outliers by only retained data within certain bounds or an error function that uses compute RMSE function to get RMSE values.

High-level functions — a function that uses one or more of medium-level functions and/or low-level functions to perform its task. For example, model training function that uses several functions such as function to get randomly sampled data, a model scoring function, a metric function, etc.

The final steps are to group all the low-level and medium-level functions that will be useful for more than one algorithm into a python file (can be imported as a module) and all other low-level and medium-level functions that will be useful only for the algorithm in consideration into another python file. All the high-level functions should reside in a separate python file. This python file dictates each step in the algorithm development — from combining data from different sources to final machine learning model.

There is no hard-and-fast rule to follow the above steps but I highly suggest you to start with these steps and develop your own style there after.

2. Logging and Instrumentation
Logging and Instrumentation (LI) are analogous to black box in air crafts that record all the happenings in the cockpit. The main purpose of LI is to record useful information from the code during its execution to help the programmer mainly to debug if anything goes awry and also to improve the performance of the code (such as reduced execution times).

What is the difference between Logging and Instrumentation?

(i) Logging — Records only actionable information such as critical failures during run time and structured data such as intermediate results that will be later used by the code itself. Multiple log levels such as debug, info, warn, and errors are acceptable during development and testing phases. However avoid them at all cost during production.

Logging should be minimal containing only information that requires human attention and immediate handling.

(ii) Instrumentation — records all other information left out in logging that would help us validate code execution steps and work on performance improvements if necessary. Here it is always better to have more data so instrument as much information as possible.

To validate code execution steps—We should record information such as task name, intermediate results, steps went through, etc. This would help us to validate the results and also to confirm that the algorithm has followed the intended steps. Invalid results or strangely performing algorithm may not raise a critical error that would be caught in logging. Hence it is imperative to records these information.

To improve performance — We should record time taken for each task/subtask and memory utilized by each variable. This would help us improve our code in making necessary changes optimizing the code to run faster and limit memory consumption (or identify memory leaks which is common in python).

Instrumentation should record all other information left out in logging that would help us to validate code execution steps and work on performance improvements. It is better to have more data than less.

3. Code Optimization

Code optimization implies both reduced time complexity (run time) as well as reduced space complexity (memory usage). The time/space complexity is commonly denoted as O(x) also known as Big-O representation where x is the dominant term in time- or space- taken polynomial. The time- and space- complexity are the metric for measuring algorithm efficiency.

For example, lets say we have a nested for loop of size n each and takes about 2 seconds each run followed by a simple for loop that takes 4 seconds for each run. Then the equation for time consumption can be written as

Time taken ~ 2 n²+4 n = O(n²+n) = O(n²)

For Big-O representation, we should drop the non-dominant terms (as it will be negligible as n tends to inf) as well as the coefficients. The coefficients or the scaling factors are ignored as we have less control over that in terms of optimization flexibility. Please note that the coefficients in the absolute time taken refers to the product of number of for loops and the time taken for each run whereas the coefficients in O(n²+n) represents the number of for loops (1 double for loop and 1 single for loop). Again we should drop the lower order terms from the equation. Hence the Big-O for the above process is O(n²).

Now, our goal is to replace least efficient part of the code with a better alternative with lower time complexity. For example, O(n) is better than O(n²). The most common killers in the code are for loops and the least common but worse than for loop are recursive functions (O(branch^depth)). Try to replace as many for loops as possible with python modules or functions which are usually heavily optimized with possible C-code performing the computation, instead of python, to achieve shorter run time.

I highly recommend you to read the section about “Big-O” in Cracking the coding interview by Gayle McDowell. In fact, try to read the entire book to improve your coding skills.

4. Unit Testing

Unit testing — automates code testing in terms of functionality

Your code have to clear multiple stages of testing and debugging before getting into production. Usually there are three levels — development, staging, and production. In some companies, there will be a level before production that mimics the exact environment of a production system. The code should be free from any obvious issues and should be able to handle potential exceptions when it reaches production.

To be able to identify different issues that may rise we need to test our code against different scenarios, different data sets, different edge and corner cases, etc. It is inefficient to carry out this process manually every time we want to test the code which would be every time we make a major change to the code. Hence opt for Unit testing which contains a set of test cases and it can be executed whenever we want to test the code.

We have to add different test cases with expected results to test our code. The unit testing module goes through each test case, one-by-one, and compares the output of the code with the expected value. If the expected results are not achieved, the test fails —it is an early indicator that you code would fail if deployed into production. We need to debug the code and then repeat the process until all test cases are cleared off.

To make our life easy, python has a module called unittest to implement unit testing.

5. Compatibility with ecosystem

Most likely, your code is not going to be a standalone function or module. It will to be integrated into company’s code ecosystem and your code has to run synchronously with other parts of the ecosystem without any flaws/failures.

For instance, lets say that you have developed an algorithm to give recommendations. The process flow usually consists of getting recent data from the database, update/generate recommendations, store it in a database which will be read by front-end frameworks such as webpages (using APIs) to display the recommended items to the user. Simple! It is like a chain, the new chain-link should lock-in with the previous and the next chain-link otherwise the process fails. Similarly, each process has to run as expected.

Each process will have a well-defined input and output requirements, expected response time, and more. If and when requested by other modules for updated recommendations (from webpage), your code should return the expected values in a desired format in an acceptable time. If the results are unexpected values (suggesting to buy milk when we are shopping for electronics), undesired format (suggestions in the form of texts rather than pictures), and unacceptable time (no one waits for mins to get recommendations, at least these days) — implies that the code is not in sync with system.

The best way to avoid such scenario is to discuss with the relevant team about the requirements before we begin the development process. If the team is not available, go through the code documentation (most probably you will find a lot of information in there) and code itself, if necessary, to understand the requirements.

6. Version Control

Git — a version control system is one of the best things that has happened in recent times for source code management. It tracks the changes made to the computer code. Perhaps there are many existing version control/tracking systems but Git is widely used compared to any other.

The process in simple terms “modify and commit”. I have over simplified it. There are so many steps to the process such as creating a branch for development, committing changes locally, pulling files from remote, pushing files to remote branch, and much more which I am going to leave it to you to explore on your own.

Every time we make a change to the code, instead of saving the file with a different name, we commit the changes — meaning overwriting the old file with new changes with a key linked to it. We usually write comments every time we commit a change to the code. Let’s say, you don’t like changes made in the last commit and want to revert back to previous version, it can be done easily using the commit reference key. Git is so powerful and useful for code development and maintenance.

You might have already understood why this is important for production systems and why it is mandatory to learn Git. We must always have the flexibility to go back to an older version that is stable just in case the new version fails unexpectedly.

7. Readability

The code you write should be easily digestible for others as well, at least for your team mates. Moreover, it will be challenging even for you to understand your own code in few months after writing the code, if proper naming conventions are not followed.

(i) Appropriate variable and function names

The variable and function names should be self explanatory. When someone reads your code it should be easy for then to find what each variable contains and what each function does, at least to some extent.

It is perfectly okay to have a long name that clearly states its functionality/role rather than having short names such as x, y, z, etc., that are vague. Try not to exceed 30 char for variable names and 50–60 for function names.

Previously, the standard code width was 80 char based on IBM standard which is totally outdated. Now, as per GitHub standards, it is around 120. Setting 1/4th limit of page width for character names we get 30 which is long enough yet doesn’t fill the page. The function names could be little longer but again shouldn’t fill the entire page. So by setting a limit of 1/2th of page width we get 60.

For instance, the variable for average age of Asian men in a sample data can be written as mean_age_men_Asia rather than age or x. Similar argument applies for function names as well.

(ii) Doc string and comments

In addition to appropriate variable and function names, it is essential to have comments and notes wherever necessary to help the reader in understanding the code.

Doc string — Function/class/module specific. The first few lines of text inside the function definition that describes the role of the function along with its inputs and outputs. The text should be placed between set of 3 double quotes.

def <function_name>:

“””<docstring>”””

return <output>

Comments — can be placed any where in the code to inform the reader about the action/role of a particular section/line. The need for comments will be considerable reduced if we give appropriate names to variables and functions — the code will be, for the most part, self explanatory.

Code review:
Although, it is not a direct step in writing production quality code, code review by your peers will be helpful in improving your coding skill.

No one writes a flawless computer code, unless someone has more than 10 years of experience. There will be always room for improvement. I have seen professionals with several years of experience writing an awful code and also interns who were pursuing their bachelors degree with outstanding coding skills — you can always find someone who is better than you. It all depends on how many how many hours someone invests in learning, practicing, and most importantly improving that particular skill.

I know that people better than you always exist but it is not always possible to find them in your team with only whom your can share your code. Perhaps you are the best in your team. In that case, it is okay to ask others in the team to test and give feedback to your code. Even though, they are not as good as you, something might have escaped your eyes that they might catch.

Code review is especially important when you are in early stages of your career. It would greatly improve your coding skills. Please follow the steps below for successfully getting your code reviewed.

(i) After you complete writing your code with all the development, testing, and debugging. Make sure you don’t leave out any silly mistakes. Then kindly request your peers for code review.

(ii) Forward them your code link. Don’t ask them to review several scripts at one time. Ask them one after the other. The comments they give for the first script are perhaps applicable to other scripts as well. Make sure you apply those changes on other scripts, if applicable, before sending out the second script for review.

(iii) Give them a week or two to read and test your code for each iteration. Also provide all necessary information to test your code like sample inputs, limitations, and so on.

(iv) Meet with each one of them and get their suggestions. Remember, you don’t have to included all their suggestions in your code, select the ones that you think will improve the code at your own discretion.

(v) Repeat until you and your team are satisfied. Try to fix or improve your code in the first few iterations (max 3–4) otherwise it might create a bad impression about your code ability.

When it comes to poor coding quality, some data scientists will say that their work does not touch a production system, and that their code therefore does not need to be of a high standard. However, I would argue that common outputs of a data scientist’s work can actually be considered production:

* The ad-hoc analysis that discusses a useful insight that was shown to a senior executive may be used for a key financial decision. You may want to re-run that analysis in the future, and you can’t tell him or her a month later that you can’t reproduce the analysis because your codebase is incomprehensible.
* The report that gets sent out every week to a whole business unit. Multiple teams will use that to base decisions on, so you would want the code that generates it to be well-tested.
* The modelling pipeline you wrote that dumps scores daily into a CRM database. If your model gets enough traction, the business will want to roll it out to other teams. Other people now suddenly need to be able to read, extend and execute your codebase.

Production code is any code that feeds some business (decision) process. Since data science by design is meant to affect business processes, most data scientists are in fact writing code that can be considered production. Data scientists should therefore always strive to write good quality code, regardless of the type of output they create. Whatever type of data scientist you are, the code you write is only useful if it is production code.

## Production code
It is hard to give a general definition of what production code is, but a key difference with non-production code, is that production code gets read and executed by many other people, instead of just the person that wrote it. We should therefore aim for our code to be

Reproducible, because many people are going to run it.
Modular and well-documented, because many people are going to read it.
These are challenges the software engineering world has already encountered, and it helps to look at how this field tackles them. I’ll discuss some tools that can give you an immediate positive impact on the quality of your work (if you are data scientist) or the quality of your team (if you are a data science manager).

Some of these tools may seem daunting to learn initially, but for a lot of these you can copy templates that you create for your first project, to your other projects. All it takes therefore is a one-time investment to learn some useful tools and paradigms, that will pay dividends throughout your career as a data scientist.

To help you get started with these tools, I have set up a bare-bones repository that contains basic template files for some of the tools that I will discuss.

### Reproducible code
When you setup the codebase for your shiny new data science project, you should immediately set up the following tools:

* Version control your codebase using git or a similar tool.
    * The first thing you should do is to set up a version controlled repository on a remote server, so that each team member can pull an up-to-date version of the code. A great, 5 minute introduction to git can be found here.
    * Try to push code changes to the remote at a regular frequency (I would recommend daily, if possible).
    * Do not work on a single branch, whether you work alone or in a team. Choose a git branching workflow you like (it doesn’t really matter which one, just use one!) and stick with it.
* Create a reproducible python environment with virtualenv or conda.
    * These tools take a configuration file (a requirements.txt in case of virtualenv, or a environment.yml in case of conda) that contains a list of the packages (with version numbers!).
    * Put this file in version control and distribute it across your team to ensure everybody is working in the same environment.
    * Consider coming up with a standard base environment so that you can reuse that whenever you or a team member start a new project.
    * Example: See the git repo here.https://github.com/thuijskens/production-tools/blob/master/requirements.txt
* Drop Jupyter notebooks as your main development tool.
    * Jupyter notebooks are great for quick exploration of the data you are working with, but do not use them as your main development tool.
    * Notebooks do not encourage a reproducible workflow, and you should see this talk for a good overview of why they don’t.
    * Use a proper IDE like PyCharm or VS code (or vim if you’re into that) when developing code. Convince your employer to buy you professional editions of this software (this is usually peanuts for the company, and can be a massive productivity boost). I develop most of my code locally, but use PyCharm’s remote execution to execute any code on the cloud or an internal VM.

## Well-documented code
After you have set up your project in a way that will support reproducibility, take the following steps to ensure that it is possible for other people to read and understand it.

* Adopt a common project structure.
     * A common structure will make it easy for both members of your team, as well as other colleagues, to understand your codebase.
    * The specifics of the project structure again don’t matter much, just choose one and stick with it. The templates from Cookiecutter and Satalia are great starting points.
* Choose a coding style convention, and configure a linter to enforce it (potentially pre-commit).
    * Enforcing code conventions will make it easier for other people to read your codebase. I would recommend using something like PEP8, as many people in industry will already be familiar with it.
    * Enforcing coding conventions using a pre-commit linter can be good, as the programmer will not have to worry too much about the conventions during programming, because the linter will pick it up.
    * Using a linter will avoid pull requests (PRs) that are littered with coding style comments. These PRs are the worst to both review and receive a review for.
    * Example: black pre-commit plugin or yapf.
* Use Sphinx to automatically create the documentation of your codebase.
    * Pick a docstring format. I personally prefer NumPyDoc, but there are others. Again it does not matter which format you choose, just choose one and stick with it. Configure your IDE to use that docstring format, so that it will automatically create a template when you write a new function or class.
    * Use sphinx-quickstart to get a set of out-of-the-box configuration files, or copy the ones from my repository.
    * Using Sphinx can seem daunting at first, but it is one of those things that you set up once and then copy the default configuration files around for from project to project.]
    https://numpydoc.readthedocs.io/en/latest/format.html

## Modular code
Finally, follow the below steps to ensure your codebase can be executed easily and robustly:

* Use a pipeline framework for your engineering and modelling workflows.
    * Frameworks like Apache Airflow and Luigi are a great way to make your code inherently modular.
    * They allow you to build your workflow as a series of nodes in a graph, and usually gives you things like dependency management and workflow execution for free.
* Write unit tests for your codebase.
    * Pick a unit testing framework (like nose or pytest) and stick with it.
    * Writing unit tests can be cumbersome, but you want these tests in your codebase to ensure everything behaves as expected! This is especially important in data science, where we deal a lot with black-box algorithms.
* Consider adding continuous integration (CI) to your repository.
    * CI can be used to run your unit tests or pipeline after every commit or merge, making sure that no change to the codebase breaks it.
    * Many vendors offer integration with the code hosting platforms like GitHub or GitLab. All you need typically is a configuration file that is committed to your codebase, and you are ready to go!

Finally, ensure that the environment you develop your code in is reasonably similar to the production environment the code is going to run in. Especially in companies where development, staging and production environments for data science are not yet well-defined, I have seen teams developing code on architecture that is extremely different than the architecture the code actually has to run on in the end.

Data scientists, adopt these standards and see your employability increase, and complaints by your more software engineering-focused colleagues decrease. You’ll spend less time worrying about reproducibility, and rewriting software so that it can make it to production. The time saved here can be used to focus more on the fun part of our job: building models.

Data science managers, consider giving your team members a couple of days to get up to speed with these tools, and you will see that your codebases become more stable. It will be easier to onboard new members to your team and you will spend less time translating initial insights to production pipelines. Having a common way of working will also allow your team to start building utilities that tap into these conventions, increasing the overall productivity of your team.