- let's look at the three main ways to iterate over DataFrame:

items()

iterrows()

itertuples()

In [None]:
import pandas as pd

df = pd.DataFrame({
    'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
    'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
    'age': [34, 29, 37, 52, 26, 32]},
    index=['id001', 'id002', 'id003', 'id004', 'id005', 'id006'])

In [None]:
df

In [None]:
df.items()

This returns a generator:

We can use this to generate pairs of col_name and data. These pairs will contain a column name and every row of data for that column. Let's loop through column names and their data:

In [None]:
for col_name, data in df.items():
    print("col_name:",col_name, "\ndata:",data)

In [None]:
for col_name,data in df.items():
    print('col_name:',col_name,"\ndata:", data[1])

In [None]:
for col_name,data in df.items():
    print('col_name:',col_name,"\ndata:",data['id002'])

Iterating DataFrames with iterrows():

While df.items() iterates over the rows in column-wise, doing a cycle for each column, we can use iterrows() to get the entire row-data of an index.

Let's try iterating over the rows with iterrows():

In [None]:
%%prun
for i, row in df.iterrows():
    print(f"Index: {i}")
    print(f"{row}\n")

Iterating DataFrames with itertuples():

The itertuples() function will also return a generator, which generates row values in tuples. Let's try this out:

In [None]:
for row in df.itertuples():
    print(row)

The itertuples() method has two arguments: index and name.

We can choose not to display index column by setting the index parameter to False:

In [None]:
%%prun
for row in df.itertuples(index=False):
    print(row)

As you've already noticed, this generator yields namedtuples with the default name of Pandas. We can change this by passing People argument to the name parameter. You can choose any name you like, but it's always best to pick names relevant to your data:

In [None]:
for row in df.itertuples(index=False, name='People'):
    print(row)

Iterrows():
    
Iterrows() is a Pandas inbuilt function to iterate through your data frame. It should be completely avoided as its performance is very slow compared to other iteration techniques. Iterrows() makes multiple function calls while iterating and each row of the iteration has properties of a data frame, which makes it slower.

- iterrows() takes 790 seconds to iterate through a data frame with 10 million records

Itertuples() is a Pandas inbuilt function to iterate through your data frame. Itertuples() make a comparatively less number of function calls than iterrows() and carry much lesser overhead. Itertuples() iterates through the data frame by converting each row of data as a list of tuples.

itertuples() takes 16 seconds to iterate through a data frame with 10 million records that are around 50x times faster than iterrows().

## Numpy Array Iteration:

- Iteration beats the whole purpose of using Pandas. Vectorization is always the best choice. Pandas come with df.values() function to convert the data frame to a list of list format.

In [None]:
df

In [None]:
Numpy_array=df.values
Numpy_array

In [None]:
for i in Numpy_array:
    print(i)

In [None]:

for i in Numpy_array:
    print(i[0])

It took 14 seconds to iterate through a data frame with 10 million records that are around 56x times faster than iterrows().


Write a Pandas program to iterate through diamonds DataFrame.

In [None]:
import pandas as pd
diamonds = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv')

In [None]:
diamonds.head()

In [None]:
diamonds.shape

In [None]:
%%prun
for i in diamonds.itertuples(index=False,name='diamond'):
    print(i)
    
# 1674377 function calls (1674355 primitive calls) in 10.949 seconds


In [None]:
%%prun
for i,row in diamonds.iterrows():
    print("Index:",i,"\nrow:",row)
# 78329388 function calls (75254804 primitive calls) in 205.141 seconds
