# Daily-Dose-of-Data-Science

[Daily Dose of Data Science](https://avichawla.substack.com) is a publication on Substack that brings together intriguing frameworks, libraries, technologies, and tips that make the life cycle of a Data Science project effortless. 

Author: Avi Chawla

[Medium](https://medium.com/@avi_chawla) | [LinkedIn](https://www.linkedin.com/in/avi-chawla/)

# Pandas vs Polars — Run-time and Memory Comparison

Post Link: [Substack](https://www.blog.dailydoseofds.com/p/pandas-vs-polars-run-time-and-memory)

LinkedIn Post: [LinkedIn](https://www.linkedin.com/feed/update/urn:li:share:7073950186053484544/)

Twitter Post: [Twitter](https://twitter.com/_avichawla/status/1668184509500489731)

In [1]:
!pip install polars



In [2]:
import polars as pl
import pandas as pd

Download the dataset from here: [Dataset](https://drive.google.com/file/d/1sSugZwVWCSsep0-4Xi7peXCd5dvglm4a/view?usp=sharing)

## Read CSV

In [2]:
%timeit pd.read_csv("dataset.csv")

4.52 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [3]:
%timeit pl.read_csv("dataset.csv")

723 ms ± 182 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [4]:
df_pd = pd.read_csv("dataset.csv")
df_pl = pl.read_csv("dataset.csv")

## To CSV

In [6]:
%timeit df_pd.to_csv("dataset_dummy_pandas.csv")

15.3 s ± 515 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
%timeit df_pl.to_csv("dataset_dummy_polars.csv")

3.26 s ± 96.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Memory Usage

In [7]:
df_pl.estimated_size() # in Bytes

633987120

In [8]:
df_pd.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4096000 entries, 0 to 4095999
Data columns (total 9 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Name                object 
 1   Company_Name        object 
 2   Employee_Job_Title  object 
 3   Employee_City       object 
 4   Employee_Country    object 
 5   Employee_Salary     int64  
 6   Employment_Status   object 
 7   Employee_Rating     float64
 8   Credits             int64  
dtypes: float64(1), int64(2), object(6)
memory usage: 1.7 GB


### Selecting Columns

In [9]:
%timeit df_pd[["Name", "Employee_Rating"]]

29.2 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [10]:
%timeit df_pl[["Name", "Employee_Rating"]]

2.22 µs ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


### Filtering

In [11]:
%timeit df_pd[df_pd.Credits>2]

130 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [13]:
%timeit df_pl.filter(pl.col('Credits') > 2)

93.5 ms ± 7.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Grouping

In [14]:
%timeit df_pd.groupby("Company_Name").Employee_Salary.mean().reset_index()

219 ms ± 8.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
%timeit df_pl.groupby("Company_Name").agg([("Employee_Salary", "mean")])

47.5 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Sorting

In [16]:
%timeit df_pd.sort_values("Employee_Salary")

884 ms ± 55.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%timeit df_pl.sort("Employee_Salary")

932 ms ± 287 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
