## Modin

Modin uses Ray or Dask to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.

It is not necessary to know in advance the available hardware resources in order to use Modin. Additionally, it is not necessary to specify how to distribute or place data. Modin acts as a drop-in replacement for pandas, which means that you can continue using your previous pandas notebooks, unchanged, while experiencing a considerable speedup thanks to Modin, even on a single machine. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas.

In [1]:
# Installation
!pip install modin

Collecting modin
  Downloading modin-0.16.2-py3-none-any.whl (957 kB)
Collecting pandas==1.5.1
  Downloading pandas-1.5.1-cp39-cp39-win_amd64.whl (10.9 MB)
Installing collected packages: pandas, modin
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.4
    Uninstalling pandas-1.3.4:
      Successfully uninstalled pandas-1.3.4
Successfully installed modin-0.16.2 pandas-1.5.1




In [4]:
# pip install "modin[ray]" # Install Modin dependencies and Ray to run on Ray
# pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
!pip install modin[all] # Install all of the above

ERROR: Invalid requirement: '#'


Modin will automatically detect which engine you have installed and use that for scheduling computation!

If you want to choose a specific compute engine to run on, you can set the environment variable MODIN_ENGINE and Modin will do computation with that engine:

In [None]:
# export MODIN_ENGINE=ray  # Modin will use Ray
# export MODIN_ENGINE=dask  # Modin will use Dask

In [9]:
# import os

# os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
# os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask
import time
import modin.pandas as mpd  # modin pandas
import pandas as pd  # normal pandas

In [50]:
%%time

print("Time taken by normal pandas")
df = pd.read_csv("gun-violence-data_01-2013_03-2018.csv")

Time taken by normal pandas
Wall time: 2.62 s


In [51]:
%%time

print("Time taken by modin pandas")
df1 = mpd.read_csv("gun-violence-data_01-2013_03-2018.csv")

Time taken by modin pandas
Wall time: 1.34 s


In [52]:
# Another approach to measure the time

In [60]:
start = time.time()

df = pd.read_csv("gun-violence-data_01-2013_03-2018.csv")

end = time.time()
pandas_duration = end - start
print("pandas_duration:- ", pandas_duration)

pandas_duration:-  2.592864513397217


In [61]:
start = time.time()

df1 = mpd.read_csv("gun-violence-data_01-2013_03-2018.csv")

end = time.time()
modin_duration = end - start
print("pandas_duration:- ", modin_duration)

pandas_duration:-  1.1691920757293701


In [62]:
# Time difference
time_diff=pandas_duration-modin_duration
print("time_diff:- ",time_diff)

time_diff:-  1.4236724376678467


In [40]:
df.head(2)

Unnamed: 0,incident_id,date,state,city_or_county,address,n_killed,n_injured,incident_url,source_url,incident_url_fields_missing,...,participant_age,participant_age_group,participant_gender,participant_name,participant_relationship,participant_status,participant_type,sources,state_house_district,state_senate_district
0,461105,2013-01-01,Pennsylvania,Mckeesport,1506 Versailles Avenue and Coursin Street,0,4,http://www.gunviolencearchive.org/incident/461105,http://www.post-gazette.com/local/south/2013/0...,False,...,0::20,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...,0::Male||1::Male||3::Male||4::Female,0::Julian Sims,,0::Arrested||1::Injured||2::Injured||3::Injure...,0::Victim||1::Victim||2::Victim||3::Victim||4:...,http://pittsburgh.cbslocal.com/2013/01/01/4-pe...,,
1,460726,2013-01-01,California,Hawthorne,13500 block of Cerise Avenue,1,3,http://www.gunviolencearchive.org/incident/460726,http://www.dailybulletin.com/article/zz/201301...,False,...,0::20,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...,0::Male,0::Bernard Gillis,,0::Killed||1::Injured||2::Injured||3::Injured,0::Victim||1::Victim||2::Victim||3::Victim||4:...,http://losangeles.cbslocal.com/2013/01/01/man-...,62.0,35.0


### Practice

In [20]:
## Applying function on column

def multiply(value):
    value=value*2
    return value

In [26]:
%%time
print("Time taken by normal pandas")
df['state_house_district']=df['state_house_district'].apply(multiply)

Time taken by normal pandas
Wall time: 58 ms


In [30]:
%%time
print("Time taken by modin pandas")
df1['state_house_district']=df1['state_house_district'].apply(multiply)

Time taken by modin pandas
Wall time: 153 ms


In [42]:
df2=df

In [48]:
%%time
print("Time taken by normal pandas")
df3=pd.concat([df,df2])

Time taken by normal pandas
Wall time: 199 ms


In [49]:
%%time
print("Time taken by normal pandas")
df3=mpd.concat([df,df2])

Time taken by normal pandas




Wall time: 2.69 s
