# Daily-Dose-of-Data-Science

[Daily Dose of Data Science](https://avichawla.substack.com) is a publication on Substack that brings together intriguing frameworks, libraries, technologies, and tips that make the life cycle of a Data Science project effortless. 

Author: Avi Chawla

[Medium](https://medium.com/@avi_chawla) | [LinkedIn](https://www.linkedin.com/in/avi-chawla/)

# 70x Faster Pandas By Changing Just One Line of Code

Post Link: [Substack](https://avichawla.substack.com/p/70x-faster-pandas-by-changing-just)

LinkedIn Post: [LinkedIn](https://www.linkedin.com/feed/update/urn:li:activity:7018900729104904193/)

In [1]:
import os
os.environ["MODIN_ENGINE"] = "ray"

In [2]:
import modin.pandas as pd
import pandas

In [3]:
import time
import ray
ray.init()

2023-01-11 17:43:20,212	INFO worker.py:1518 -- Started a local Ray instance.


0,1
Python version:,3.9.12
Ray version:,2.0.0


### Get Dataset

In [6]:
file = "../_Extras/taxi.csv"

In [4]:
import urllib.request
s3_path = "https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv"
urllib.request.urlretrieve(s3_path, file)

## 200 MB Dataset

('../_Extras/taxi.csv', <http.client.HTTPMessage at 0x7fb54194bd60>)

### Read CSV

In [7]:
start = time.time()

pandas_df = pandas.read_csv(file, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to read with pandas: 3.781 seconds


In [9]:
start = time.time()

modin_df = pd.read_csv(file, parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))

print("Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))

Time to read with Modin: 1.563 seconds
Modin is 2.42x faster than pandas at `read_csv`!


### Concat

In [16]:
start = time.time()

big_pandas_df = pandas.concat([pandas_df for _ in range(20)])

end = time.time()
pandas_duration = end - start
print("Time to concat with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to concat with pandas: 8.603 seconds


In [18]:
start = time.time()

big_modin_df = pd.concat([modin_df for _ in range(20)])

end = time.time()
modin_duration = end - start
print("Time to concat with Modin: {} seconds".format(round(modin_duration, 3)))

print("Modin is {}x faster than pandas at `concat`!".format(round(pandas_duration / modin_duration, 2)))

Time to concat with Modin: 0.121 seconds
Modin is 71.07x faster than pandas at `concat`!


In [23]:
!rm "../_Extras/taxi.csv"