<a href="https://colab.research.google.com/github/Ghonem22/Speed-up-your-dataframe-analysis-till-1500X/blob/main/Speed_up_your_dataframe_analysis_till_1500X_using_swifter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## install required library
**Don't forget to restart the kernel after installation**

In [None]:
!pip install swifter

## required libraries

In [5]:
import shutil
import swifter
import time

## Download and unzip data

In [5]:
!wget https://www.stats.govt.nz/assets/Uploads/New-Zealand-business-demography-statistics/New-Zealand-business-demography-statistics-At-February-2021/Download-data/Geographic-units-by-industry-and-statistical-area-2000-2021-descending-order-CSV.zip

--2022-04-13 02:46:51--  https://www.stats.govt.nz/assets/Uploads/New-Zealand-business-demography-statistics/New-Zealand-business-demography-statistics-At-February-2021/Download-data/Geographic-units-by-industry-and-statistical-area-2000-2021-descending-order-CSV.zip
Resolving www.stats.govt.nz (www.stats.govt.nz)... 45.60.11.104
Connecting to www.stats.govt.nz (www.stats.govt.nz)|45.60.11.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22898963 (22M) [application/zip]
Saving to: ‘Geographic-units-by-industry-and-statistical-area-2000-2021-descending-order-CSV.zip’


2022-04-13 02:46:55 (7.84 MB/s) - ‘Geographic-units-by-industry-and-statistical-area-2000-2021-descending-order-CSV.zip’ saved [22898963/22898963]



In [8]:
shutil.unpack_archive("/content/Geographic-units-by-industry-and-statistical-area-2000-2021-descending-order-CSV.zip", "/content")

In [3]:
import pandas as pd
df = pd.read_csv("/content/Data7602DescendingYearOrder.csv")
print(f"this dataframe has {len(df)} rows \n\n\n")
df.head()

this dataframe has 5704247 rows 





Unnamed: 0,anzsic06,Area,year,geo_count,ec_count
0,A,A100100,2021,90,200
1,A,A100200,2021,141,210
2,A,A100300,2021,6,25
3,A,A100400,2021,54,65
4,A,A100500,2021,63,95


## using apply with simple summation task

In [6]:
def sum(x, y):
    return x * 15 + y ** 2

In [7]:
t = time.time()
df['new'] = df.apply(lambda x: sum(x['geo_count'], x['ec_count']), axis=1)
time.time() - t

86.58760833740234

In [8]:
del df['new']

In [9]:
t = time.time()
df['new'] = df.apply(lambda x: sum(x.geo_count, x.ec_count) , axis=1)
time.time() - t

124.95700478553772

**We can see that calling columns as "x['geo_count']" is abit faster compared to calling them as "x.geo_count"**

## Let's test our tool: **swifter**

In [12]:
t = time.time()
df['new'] = df.swifter.apply(lambda x: sum(x['geo_count'], x['ec_count']), axis=1)
time.time() - t

0.0893561840057373

In [23]:
t = time.time()
df['new'] = df.swifter.apply(lambda x: sum(x.geo_count, x.ec_count), axis=1)
time.time() - t

0.09285783767700195

In [16]:
86.58760833740234 / 0.0953836441040039

907.7825569653158

In [15]:
124.95700478553772 / 0.0893561840057373


1398.414739572077

**As We see, Just by adding swifter, the code is faster between 900X to 1398X**

### Why is **swifter** that faster?

**The answer is simple, swifter support Parallelization, so the processing become faster when you deal with big data**

## Can we make the code more faster?



**The answer in vectorization:**
* You should try first to vectorize your process, this is the optimal solution
* in some cases, it will be hard to do that, so you can use swifter to speed up your code using the advantage of Parallelization 

In [21]:
t = time.time()
df['new2'] = df['geo_count'] * 15 + df['ec_count'] ** 2
time.time() - t

0.07243514060974121

In [24]:
df['new'] == df['new2']

0          True
1          True
2          True
3          True
4          True
           ... 
5704242    True
5704243    True
5704244    True
5704245    True
5704246    True
Length: 5704247, dtype: bool

## resources:

1. https://towardsdatascience.com/speed-up-your-pandas-processing-with-swifter-6aa314600a13

2. https://towardsdatascience.com/do-you-use-apply-in-pandas-there-is-a-600x-faster-way-d2497facfa66