# <font color=#ac9f00><center>DataFrame iterations</center></font>

Pandas has many powerful tools to aggregate our data, do some basic arithmetics and statistics. But there are still many times we need to iterate through DataFrame and do some specific tasks with the data. And as always, there are many ways to do that, from standard for loops to vectorization, so what is the difference ?

### <font color=#ac9f00>You want that by the morning or next month ?</font>

The title is meant as a joke, but as well could be true. Working with dataframe with few thousand rows wont make much of a difference, but the moment we start working with Big Data, the chosen method could save us some serious time! I myself must admit, i did not pay much time to this topic up until now, but __learn the best practices / best ways of coding as you start, so you will make them your habit__. There are many great sources for this topic, so i will just simplify them and try them out in this notebook, but definitely read through some stackoverflow and documentations on this, its the core of many more operations you will use 

### <font color=#ac9f00>Looping vs Vectorization</font>

<font color=#ac9f00>Looping</font> is commonly known and used in most standard programming languages and as such is then used here (as i did...). This consists of your standard vanilla for loops or iteration methods in Pandas __.itertuples(), .iterrows()__. These functions run through all rows separately, thus generating each individual row as object for processing. And that is a lot of extra work done without any purpose

<font color=#ac9f00>Vectorization</font> on the other hand is optimized for speed. The difference is in a way the data is accessed, we can think of DataFrame as a number of series next to each other (columns) and use vectorized functions - in both numpy and pandas you will find most of your needed functions. These functions are then applied on the individual Series - array of the data we need to manipulate. So there is no need to access each individual value for each iteration step, we work in one, simple and effective array

enough text, lets try it out. We will use built-in function __%timeit__ to measure the time and for simplicity, we will create a simple random integer generated dataframe. But in the end we will just make some assignment with basic calculation, there are more columns just to show again how to access specific ones

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(np.random.randint(100, size=(10000,3)))
df.columns = ["a", "b", "c"]
y = 15
df.head(3)

Unnamed: 0,a,b,c
0,23,29,76
1,2,43,23
2,76,89,32


#### <font color=#ac9f00>__.iterrows()__</font>

In [3]:
%%timeit 
# double %% means it will time the whole cell in the notebook, single % will time only 1 line of code

for i, row in df.iterrows():
    df.loc[i,"b"] = y*2

1.58 s ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### <font color=#ac9f00>__.itertuples()__</font>

In [4]:
%%timeit 

for row in df.itertuples():
    df.loc[row.Index,"b"] = y*2

968 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### <font color=#ac9f00>__.apply()__</font>
here we will use apply function with lambda expression

In [5]:
%timeit df["b"] = df["b"].apply(lambda x: y*2)

2.65 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
%timeit df["b"] = df["b"].map(lambda x: y*2)

2.69 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### <font color=#ac9f00>__list comprehension__</font>

In [7]:
%timeit df["b"] = [y*2 for x in df["b"]]

2.72 ms ± 56.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### <font color=#ac9f00>__Vectorization__</font>

In [8]:
%timeit df.loc[:,"b"] = y*2

160 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## <font color=#ac9f00>__Verdict ? By the morning please__</font>

__.iterrows()__ &emsp; 1 580 000 µs

__.itertuples()__ &emsp; 969 000 µs

__.apply__ &emsp; 2 650 µs

__list comprehension__ &emsp; 2 720 µs

__vectorization__ &emsp;160 µs







As you can see, the vectorized method is almost 10 000x faster than the standard iteration method. And this was a DataFrame with 10k rows with basic arithmetics, now image doing some more complex tasks on a dataset with gigabytes of data! So nevertheless the size of the data you are working with, always search for the most efficient method possible. A little extra time working it out now can save you an eternity in the future ... Yea, i was never good with trying to sound smart. Well, if you read it until here, i thank you for your time and hope it will be returned with some interest on top of that, have a great day !