# introduction to the vectorizing principle

Disclaimer: The term vectorization is also used for talking about using SIMD based instruction level parallelism provided by CPUs. Here, we talk about vectorization as a library design pattern for structural data transformation code -- applying operations to vectors instead of scalars.

## part 1: moving the loop into the library

Loops over rows of a dataframe which may easily include milllions of rows are perfectly fine in some languages (e.g. C++, Java, Julia) due to minimal loop overhead of compilation and just-in-time compilation (JIT). However, they are practically forbidden in interpreted languages such as Matlab, R, and python.

#### why Matlab, R, or python if loops are often 1000 times slower than C++?

These languages have a huge ecosystem of open-source and commercial libraries which make them very powerful tools to perform most tasks just with a few lines of code.

#### Initialization of a trivial dataframe with three rows:

In [1]:
import pandas as pd
df = pd.DataFrame(dict(int_col=[1,2,3], str_col=["a", "b", "c"]))
df

Unnamed: 0,int_col,str_col
0,1,a
1,2,b
2,3,c


#### Technically, dataframes support iteration over rows also in python. When extracting column values of a row, you get primitive pandas types like `int` or `str`:

In [2]:
for i, row in df.iterrows():
    print(i, type(row["int_col"]), row["int_col"], type(row["str_col"]), row["str_col"])

0 <class 'int'> 1 <class 'str'> a
1 <class 'int'> 2 <class 'str'> b
2 <class 'int'> 3 <class 'str'> c


#### This way of iterating over rows and performing an action with column values would be just fine in C++, Java, or Julia. Since we don't have a million of rows in this example, it also works in python:

In [3]:
out = pd.Series(0, index=range(len(df)))
for i, row in df.iterrows():
    out[i] = row["int_col"] * 2
out

0    2
1    4
2    6
dtype: int64

#### The idea of the vectorized version of the example above is to move the loop in to the library which is written in a fast compiled language like C, C++, Rust or even Fortran. The type of a column of a pandas DataFrame is a pandas Series (similar to std::vector in C++). Types of objects in a series are homogeneous per column. Column types are called *dtype* in pandas. The column `int_col` has the dtype int64.

In [4]:
print(type(df["int_col"]))
df.dtypes

<class 'pandas.core.series.Series'>


int_col     int64
str_col    object
dtype: object

#### The vectorized version of the example above is to apply an element-wise multiply-operator to the series:

In [5]:
out2 = df["int_col"] * 2
out2

0    2
1    4
2    6
Name: int_col, dtype: int64

#### It is possible to element-wise multiply two series of equal length and to store the result back to the original dataframe as new column:

In [6]:
df["out3"] = df["int_col"] * out
df

Unnamed: 0,int_col,str_col,out3
0,1,a,2
1,2,b,8
2,3,c,18


Next: [vectorization02.ipynb](vectorization02.ipynb): vectorized translation of conditional statements