In [1]:
import pandas as pd
import numpy as np

In [2]:
# Import the DataSet
# df = pd.read_csv('https://raw.githubusercontent.com/mlabonne/how-to-data-science/main/data/nslkdd_test.txt')

df = pd.read_csv('C:/MyLearn/DataSet/Pandas/nslkdd_test.txt')
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack_type,other
0,0,tcp,private,REJ,0,0,0,0,0,0,...,0.04,0.06,0.00,0.00,0.00,0.0,1.00,1.00,neptune,21
1,0,tcp,private,REJ,0,0,0,0,0,0,...,0.00,0.06,0.00,0.00,0.00,0.0,1.00,1.00,neptune,21
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,0.61,0.04,0.61,0.02,0.00,0.0,0.00,0.00,normal,21
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,1.00,0.00,1.00,0.28,0.00,0.0,0.00,0.00,saint,15
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,0.31,0.17,0.03,0.02,0.00,0.0,0.83,0.71,mscan,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,0,tcp,smtp,SF,794,333,0,0,0,0,...,0.72,0.06,0.01,0.01,0.01,0.0,0.00,0.00,normal,21
22540,0,tcp,http,SF,317,938,0,0,0,0,...,1.00,0.00,0.01,0.01,0.01,0.0,0.00,0.00,normal,21
22541,0,tcp,http,SF,54540,8314,0,0,0,2,...,1.00,0.00,0.00,0.00,0.00,0.0,0.07,0.07,back,15
22542,0,udp,domain_u,SF,42,42,0,0,0,0,...,0.99,0.01,0.00,0.00,0.00,0.0,0.00,0.00,normal,21


This dataset has **22k rows and 43 columns** with a combination of categorical and numerical values. Each row describes a connection between two computers.

Let’s say we want to create a new feature: the total number of bytes in the connection. We just have to sum up two existing features: `src_bytes` and `dst_bytes`. Let's see different methods to calculate this new feature.

In [3]:
%%time
df['Total_bytes']=df['src_bytes']+df['dst_bytes']

CPU times: total: 0 ns
Wall time: 2 ms


In [4]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


### Method 1: Iterrows

According to the official documentation, `iterrows()` iterates "over the rows of a Pandas DataFrame as (index, Series) pairs". It **converts each row into a Series object**, which causes two problems:

1. It can **change the type of your data** (dtypes);
2.  The conversion **greatly degrades performance**. For these reasons, the ill-named `iterrows()` is the **WORST possible method** to actually iterate over rows.

In [5]:
%%time
# Iterrows
total = []
for index, row in df.iterrows():
    total.append(row['src_bytes'] + row['dst_bytes'])
df['Total_bytes']=total

CPU times: total: 1.56 s
Wall time: 1.54 s


In [6]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


Now let’s see slightly better techniques…

### Method 2. For loop with .loc or .iloc (3× faster)

This is what I used to do when I started: a basic **for loop to select rows by index** (with `.loc` or `.iloc`). Why is it bad? Because DataFrames are **not designed for this purpose**. As with the previous method, rows are **converted into Pandas Series objects**, which degrades performance. Interestingly enough,`.iloc` is faster than .loc. It makes sense since Python **doesn't have to check user-defined labels** and directly look at **where the row is stored in memory**.

In [7]:
%%time
# For loop with .loc
total = []
for index in range(len(df)):
    total.append(df['src_bytes'].loc[index] + df['dst_bytes'].loc[index])
df['Total_bytes']=total

CPU times: total: 1.27 s
Wall time: 1.27 s


In [8]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


In [9]:
%%time
# For loop with .iloc
total = []
for index in range(len(df)):
    total.append(df['src_bytes'].iloc[index] + df['dst_bytes'].iloc[index])
df['Total_bytes']=total

CPU times: total: 844 ms
Wall time: 870 ms


In [10]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


Even this basic for loop with .iloc is 3 times faster than the first method!

 ### Method 3. Apply (4× faster)

The `apply()` method is **another popular choice** to iterate over rows. It creates code that is **easy to understand** but at a cost: performance is **nearly as bad as the previous for loop**.

This is why I would strongly advise you to **avoid this function for this specific purpose** (it's fine for other applications).

Note that I **convert the DataFrame into a list** using the `to_list()` method to obtain **identical results**.

In [11]:
%%time
# Apply
df['Total_bytes']=df.apply(lambda row: row['src_bytes'] + row['dst_bytes'], axis=1)

CPU times: total: 516 ms
Wall time: 527 ms


In [12]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


The `apply()` method is a for loop in disguise, which is why the performance doesn't improve that much: it's only 4 times faster than the first technique.

### Method 4. Itertuples (10× faster)

If you know about `iterrows()`, you probably know about `itertuples()`. According to the official documentation, it iterates "over the rows of a DataFrame as namedtuples of the values". In practice, it means that **rows are converted into tuples**, which are much lighter objects than Pandas Series.

This is why `itertuples()` are a better version of `iterrows()`. The only issue is that we need to **select columns based on their index**, which is **not as user-friendly** as previous techniques.

In [13]:
%%time
# Itertuples
total = []
for row in df.itertuples():
    total.append(row[5] + row[6])
df['Total_bytes']=total

CPU times: total: 172 ms
Wall time: 167 ms


In [14]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


Okay, it’s 10 times faster than iterrows(): this is starting to look better.

### 5. List comprehensions (200× faster)

List comprehensions are a **fancy way to iterate over a list as a one-liner**. For instance, `[print(i) for i in range(10)]` prints numbers from 0 to 9 **without any explicit for loop**. I say "**explicit**" because Python actually **processes it as a for loop** if we look at the **bytecode**. So why is it **faster**? Quite simply because we don't call the `.append()` method in this version.

In [15]:
%%time
# List comprehension
total=[]
total = [src + dst for src, dst in zip(df['src_bytes'], df['dst_bytes'])]
df['Total_bytes']=total

CPU times: total: 31.2 ms
Wall time: 23 ms


In [16]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


Indeed, this technique is **200 times faster** than the first one! But we can still do better.

### 6. Pandas vectorization (1500× faster)

Until now, all the techniques used **simply add up single values**. Instead of adding single values, why not **group them into vectors to sum them up**? The difference between **adding two numbers or two vectors is not significant for a CPU**, which should speed things up.

On top of that, Pandas can **process Series objects in parallel**, using every CPU core available!

The syntax is also the simplest imaginable: this solution is **extremely intuitive**. Under the hood, **Pandas takes care of vectorizing our data** with an **optimized C code using contiguous memory blocks**.

In [17]:
%%time
# Vectorization
df['Total_bytes']=df['src_bytes'] + df['dst_bytes']

CPU times: total: 0 ns
Wall time: 2 ms


In [18]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


This code is **1500 times faster than iterrows()** and it is even simpler to write.

### 7. NumPy vectorization (1900× faster)

NumPy is designed to **handle scientific computing**. It has **less overhead** than Pandas methods since rows and dataframes all become **np.array**. It relies on the **same optimizations** as Pandas vectorization.

There are **two ways of converting a Series into a** `np.array`: using `.values` or `.to_numpy()`. The former has been **deprecated for years*, which is why we're gonna use `.to_numpy()` in this example.

In [19]:
%%time
# Numpy vectorization

df['Total_bytes']= df['src_bytes'].to_numpy() + df['dst_bytes'].to_numpy()

CPU times: total: 0 ns
Wall time: 999 µs


In [20]:
print(df[['src_bytes','dst_bytes','Total_bytes']])
del df['Total_bytes']

       src_bytes  dst_bytes  Total_bytes
0              0          0            0
1              0          0            0
2          12983          0        12983
3             20          0           20
4              0         15           15
...          ...        ...          ...
22539        794        333         1127
22540        317        938         1255
22541      54540       8314        62854
22542         42         42           84
22543          0          0            0

[22544 rows x 3 columns]


We found our winner with a technique that is **1900 times faster** than our first competitor! Let’s wrap things up.

### Conclusion

Don’t be like me: if you need to **iterate over rows in a DataFrame, vectorization** is the way to go! They’re **not harder to read**, they **don’t take longer to write**, and the **performance gain is incredible**.

It’s not just about performance: **understanding how each method works** under the hood helped me to **write better code**. Performance gains are always based on the same techniques: **transforming data into vectors and matrices** to take advantage of **parallel processing**. Alas, this is often at the **expense of readability**. But it doesn’t have to be.

***