# Why do we need to optimize Pandas ?

#### __*Pandas*__ is one the greatest and most useful Python Libraries you will ever find, but when you deal with BIG DATA (> 1 million rows) , you can certainly find simple ways to optimise it and save a lot of time for yourself.
#### In this Notebook we find how to loop over Pandas columns , compare two of them to find the result of the third and __WE TIME IT__ to find the most optimised one ! <br><font color=blue>Let's look at the following ways and analyse Time Complexitiesof each method - </font><br>
- Using the Standard Python Looping
- Using Pandas __*iterrows()*__
- Using the Pandas __*apply()*__ method
- Using the Pandas Vectorization Method

# Importing Necessary Libraries 

In [1]:
import pandas as pd
import numpy as np
import random

## Let's create a __Randomly Generated Dataset__ for us to Process and observe time complexities of each method

In [2]:
df = pd.DataFrame()
for i in range(2):
    df[i]= random.sample(range(1,10000000),10000) # Simply taking 10000 rows to observe the difference
df.columns = ['a' , 'b']

In [3]:
df.head()

Unnamed: 0,a,b
0,4831958,7148937
1,3700384,9058607
2,4260736,655587
3,7857849,8903877
4,9633463,6442000


# __Let's Start__

# Standard Python Loop

In [4]:
%%timeit # We use %%timeit magic function ot observe how much time each method takes and find the best one 

df['standard'] = np.nan
for row in range(len(df)):
    if df.loc[row,'b']<df.loc[row,'a']:
        df.loc[row , 'standard'] = 'small'
    else:
        df.loc[row , 'standard'] = 'big'

11.8 s ± 376 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
df.head()

Unnamed: 0,a,b,standard
0,4831958,7148937,big
1,3700384,9058607,big
2,4260736,655587,small
3,7857849,8903877,big
4,9633463,6442000,small


# Iterrows() in Python

In [6]:
%%timeit
df['iterrows'] = np.nan
list_1 = []
for i,row in df.iterrows():
    if row['b']<row['a']:
        list_1.append('small')
    else:
        list_1.append('big')

df['iterrows'] = list_1

1.58 s ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
df.head()

Unnamed: 0,a,b,standard,iterrows
0,4831958,7148937,big,big
1,3700384,9058607,big,big
2,4260736,655587,small,small
3,7857849,8903877,big,big
4,9633463,6442000,small,small


# Apply Method in Python

In [8]:
%%timeit

df['apply'] = np.nan
df['apply'] = df.apply(lambda row :'small' if row['b']<row['a'] else 'big',axis=1)

275 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
df.head()

Unnamed: 0,a,b,standard,iterrows,apply
0,4831958,7148937,big,big,big
1,3700384,9058607,big,big,big
2,4260736,655587,small,small,small
3,7857849,8903877,big,big,big
4,9633463,6442000,small,small,small


# Pandas Vectorization

In [10]:
%%timeit

df['p_vec'] = 'big'
df.loc[df['b']<df['a'],'p_vec'] = 'small'

4.21 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
df.head()

Unnamed: 0,a,b,standard,iterrows,apply,p_vec
0,4831958,7148937,big,big,big,big
1,3700384,9058607,big,big,big,big
2,4260736,655587,small,small,small,small
3,7857849,8903877,big,big,big,big
4,9633463,6442000,small,small,small,small


# __Observations__

- Standard Python Loop Time  : __11.8 seconds__
- Iterrows() Fcuntion Time   : __1.58 seconds__
- Python *Apply* Method Time : __275  milliseconds__
- Pandas Vectorization Time  : __4.21 milliseconds__

## We observe that using Pandas Vectorization, with __*loc*__ can substantially reduce the processing time of a Pandas Dataframe. 

__We can reduce the time by almost 2800 times !__ 