# introduction to the vectorizing principle

## part 2: vectorized translation of conditional statements

The first hurdle to vectorization is that control flow constructs like if-statements have no vectorized equivalent. One might think about a functional approach where the true code-block is applied to some rows, whereas the false code-block on others. However, that would require a switch from library code to user code in a slow interpreted language for every loop iteration. This would also be too slow for millions of rows.

#### Initialization of a trivial dataframe with three rows:


In [2]:
import pandas as pd
df = pd.DataFrame(dict(int_col=[1,2,3], str_col=["a", "b", "c"]))
df

Unnamed: 0,int_col,str_col
0,1,a
1,2,b
2,3,c


#### Here is the loop version that needs translation to vectorized code:

In [3]:
out = pd.Series(0, index=range(len(df)))
for i, row in df.iterrows():
    if row["str_col"] == "a":
        out[i] = row["int_col"] * 2
    else:
        out[i] = row["int_col"] * 3
out

0    2
1    6
2    9
dtype: int64

#### There are two general solutions to the problem of vectorizing conditional statements: 1) providing both true-block and false-block values for all columns to a vectorized library call which also gets a boolean series to decide which one to use:

In [4]:
import numpy as np
out2 = np.where(
    df["str_col"] == "a",  # boolean series for condition
    df["int_col"] * 2,  # true-block expression
    df["int_col"] * 3  # false-block expression
)
out2

array([2, 6, 9])

#### 2) the filtering solution which can be implemented by filtering on the left hand side of the assignment as well:

In [5]:
out3 = df["int_col"] * 3
_filter = df["str_col"] == "a"
out3[_filter] = df["int_col"][_filter] * 2
out3

0    2
1    6
2    9
Name: int_col, dtype: int64

#### The np.where solution typically provides nicer code since in the filtering solution it is very cumbersome to ensure that the same filter is applied to each series used in the right hand side expression. There are further silent bugs/traps caused by index handling of pandas which should not be discussed here. However, the filtering solution is needed when computing a conditional-column where unneccessarily computed values for np.where cause an exception:

In [6]:
try:
    df["power"] = (df["int_col"]-2)
    df["powered"] = np.where(df["power"] >= 0, df["int_col"] ** df["power"], df["int_col"])
except ValueError as e:
    print("ValueError:", e)

ValueError: Integers to negative integer powers are not allowed.


#### The filtering solution works here:

In [7]:
df["power"] = (df["int_col"]-2)
df["powered"] = df["int_col"]
_filter = df["power"] >= 0
df.loc[_filter, "powered"] = df["int_col"][_filter] ** df["power"][_filter]
df

Unnamed: 0,int_col,str_col,power,powered
0,1,a,-1,1
1,2,b,0,1
2,3,c,1,3


Next: [vectorization03.ipynb](vectorization03.ipynb): a slightly more complex example