# If you use a for loop you are doing it wrong
*How a declarative mindset will help you write much better data science code*

Whenever I see a for loop in a piece of data science Python code, my first response is "that is probably not needed". The for loop however is just one example of a more deep philosophical difference between a more traditional imperative approach to software engineering and a data science approach that is more declarative in nature. In essence, in imperative programming the focus is on telling the computer *how* to perform a task, in declarative programming we simply state *what* we want and the computer should take care of how the task is performed. This often leads to much shorter and faster code. 

The goal of this article is to make you aware of this difference in coding style. Especially for people transitioning into data science from more mainstream programming in for example C# or C++ getting into a declarative mindset is very important. 

# A well meaning iris
The goal of our first example is to calculate the mean of a column in a dataframe. Please feel free to try and solve the issue yourself first before looking at both approaches. 

As a basis we use the iris dataset, and we want to calculate the mean of the `sepal_length` column:

In [18]:
import pandas as pd
import seaborn as sns
import numpy as np

iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


We first start with the imperative approach: we iterate over the numbers in the `sepal_length` column, add them all up, keep track of the length, and finally calucalte the mean by dividing the sum total by the length of the column:

In [7]:
sum_total = 0
length = 0
for number in iris['sepal_length']:
    sum_total += number
    length += 1
sum_total / length

5.843333333333335

Alternatively, this is the declarative approach:

In [8]:
iris['sepal_length'].mean()

5.843333333333335

So, in the imperative solution we spend a lot of code telling the computer what to do. In the declarative approach we simply state we want to have the mean of that particluar column. This nicely illustrates that the code is a lot shorter and operates on a higher abstraction level then the imperative approach. 

# Min-max per species
The next example is a bit more complicated. We want to calculate the minimum and maximum of each of the measured variables per unique type of iris. So, a minimum for `sepal_length` for each of the three types of iris, the same for `sepal_width`, etc. Before looking at my answers, feel free to try this yourself. 

We first start with the purely imperative approach. Note that I intentionally omitted any of the smarter Python and Pandas syntax just to hammer home how much code you need to do this imperatively:

In [38]:
# Determine unique species of iris
iris_species = []
for entry in iris['species']:
    if not entry in iris_species:
        iris_species.append(entry)

# Prepare nested dictionary to store min and max values per
# unique iris type and column in the dataset. This makes life 
# a lot easier when we actually loop over the data. 
value_columns = iris.columns[:4]
min_values = {}
max_values = {}
for col in value_columns:
    per_type = {}
    for species in iris_species:
        per_type[species] = np.NaN
    min_values[col] = per_type
    max_values[col] = per_type

# Go through the data and actually determine the min and max
for column in value_columns:
    for idx, number in enumerate(iris[column]):
        current_species = iris['species'][idx]
        current_min = min_values[column][current_species]
        if (np.isnan(current_min)) or (number < current_min):
            current_min = number
        min_values[column][current_species] = current_min   
        
        current_max = min_values[column][current_species]
        if (np.isnan(current_max)) or (number > current_max):
            current_max = number
        max_values[column][current_species] = current_max 
        
min_values

{'sepal_length': {'setosa': 5.0, 'versicolor': 5.7, 'virginica': 5.9},
 'sepal_width': {'setosa': 3.3, 'versicolor': 2.8, 'virginica': 3.0},
 'petal_length': {'setosa': 1.4, 'versicolor': 4.1, 'virginica': 5.1},
 'petal_width': {'setosa': 0.2, 'versicolor': 1.3, 'virginica': 1.8}}

In [39]:
max_values

{'sepal_length': {'setosa': 5.0, 'versicolor': 5.7, 'virginica': 5.9},
 'sepal_width': {'setosa': 3.3, 'versicolor': 2.8, 'virginica': 3.0},
 'petal_length': {'setosa': 1.4, 'versicolor': 4.1, 'virginica': 5.1},
 'petal_width': {'setosa': 0.2, 'versicolor': 1.3, 'virginica': 1.8}}

The beauty of the declarative approach is that it almost directly follows from the problem statement: 

In [9]:
# we want to calculate the minimum and maximum of each of the measured variables per unique type of iris.
(
    iris
      .groupby(['species'])   # each of the measured variables per unique type of iris
      .agg(['min', 'max'])    # calculate the minimum and maximum
)

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_width,sepal_width,petal_length,petal_length,petal_width,petal_width
Unnamed: 0_level_1,min,max,min,max,min,max,min,max
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
setosa,4.3,5.8,2.3,4.4,1.0,1.9,0.1,0.6
versicolor,4.9,7.0,2.0,3.4,3.0,5.1,1.0,1.8
virginica,4.9,7.9,2.2,3.8,4.5,6.9,1.4,2.5


This code is:

- Much much shorter: 3 line versus 34 lines. 
- More versatile then the imperative code. For example, adding `median` in addition to `min` and `max` is a lot simpler. 
- A lot faster for bigger datasets. 

# Finally declarative
To really get into the declarative mindset will take some time. Especially for people who are already experienced in other more imperative languages this can be hard. A good excercise is to force yourself to solve problems using the builtin pandas solutions. If you feel yourself going to explicit loops, go back to the drawing board. For loops are not always the wrong answer, but in the beginning I would err on the side of caution. 