# Swifter

**Most important:**
It will deal with GIL (only one thread can be in a state of execution at any point in time) under the hood and use all the cores.
https://realpython.com/python-gil/


A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.

https://github.com/jmcarpenter2/swifter

https://github.com/jmcarpenter2/swifter/blob/master/examples/swifter_apply_examples.ipynb

**Optimizing code:**
1. Avoid loops; they’re slow and, in most common use cases, unnecessary.
2. Vectorization is usually better than scalar operations. 


# Vectorization in Python 


In [None]:
import numpy as np
from time import time

n=9999999
x=np.random.rand(n)
y=np.random.rand(n)

In the following section, we are going to add two numpy array, one in the elementwise for loop way, and the other in the vectorization way, and compare the time. 

In [None]:
start_time=time()
s1=[]

for k in range(n):
  s1.append(x[k]+y[k])

end_time=time()

t1=end_time-start_time

print ("the running time for elementwise adding is %s seconds"%(t1))

In [None]:
start_time=time()
s2=x+y
end_time=time()
t2=end_time-start_time

print("the running time for vectorization adding is %s seconds"%(t2))

In [None]:
(s1==s2).all()

In [None]:
t1/t2

# Swifter Demo

In [None]:
!pip install swifter

**Make sure you restart your run time**

In [None]:
#Import the package
import pandas as pd
import swifter

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
#read the dataset
df = pd.read_csv('r_dataisbeautiful_posts.csv')

In [None]:
%time df['score_2_subs'] = df['score'].apply(lambda x: x/2 )

In [None]:
#When we importing the Swifter package, it would integrated with Pandas package and we could use functional attribute from Pandas such as apply
%time df['score_2_swift'] = df['score'].swifter.apply(lambda x: x/2 )

## Vectorized Function for Swifter

From the documentation, it is stated that Swifter could apply function a hundred times faster than Pandas function. This, however, only applied if we are using a vectorized form of function.

Let’s say I create a function that evaluates the num_comments and score variable. When the comment count is zero, I will double the score. While it’s not, the score would stay the same. Then I would create a new column based on that.

In [None]:
# Some thing like this useful when we are doing feature engineering as well.

def scoring_comment(x):
    if x['num_comments'] == 0:
        return x['score'] *2
    else:
        return x['score']
        
#Trying applying the function using Pandas apply
%time df['score_comment'] = df[['score','num_comments']].apply(scoring_comment, axis =1)

In [None]:
# Swifter apply
%time df['score_comment_swift'] = df[['score', 'num_comments']].swifter.apply(scoring_comment, axis =1)

As we can see above,Swifter apply is not that faster compared to the regular Pandas apply function. This is because Swifter with non-vectorized function would implement dask parallel processing, not relying on the Swifter processing itself. So, how is the performance if we change the function to the vectorized function? Let’s try it.

In [None]:
import numpy as np

#Using np.where to implement vectorized function
def scoring_comment_vectorized(x):
    return np.where(x['num_comments'] ==0, x['score']*2, x['score'])
    
#Trying using the normal Pandas apply
%time df['score_comment_vectorized'] = df[['score', 'num_comments']].apply(scoring_comment_vectorized, axis =1)

In [None]:
# Swifter apply
%time df['score_comment_vectorized_swift'] = df[['score', 'num_comments']].swifter.apply(scoring_comment_vectorized, axis =1)

Good read: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06