## 1 - Bacgroud
Pandas released a pre release version of pandas 2.0, it's now support Arrow as it's backend to storing data and etc.
Hence we want to have a comparsion like:

- Pandas 1.x vs Pandas 2.x
- Pandas 2.x vs Polars

In [1]:
import pandas as pd
import numpy as np
import polars as pl
import pyarrow as pa
import datetime

print(f"Pandas version is {pd.__version__}")
print(f"Numpy version is {np.__version__}")
print(f"Polars version is {pl.__version__}")

Pandas version is 2.0.0rc0
Numpy version is 1.24.2
Polars version is 0.16.12


## 2- Pandas Numpy Backend

By default it's still numpy array in Pandas 2.0

In [2]:
# loading csv data
df = pd.read_csv('train_peptides.csv')
df_polars = pl.read_csv('train_peptides.csv')

In [3]:
df.head()

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981834 entries, 0 to 981833
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   visit_id          981834 non-null  object 
 1   visit_month       981834 non-null  int64  
 2   patient_id        981834 non-null  int64  
 3   UniProt           981834 non-null  object 
 4   Peptide           981834 non-null  object 
 5   PeptideAbundance  981834 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 44.9+ MB


In [5]:
# check the type
type(df['PeptideAbundance'].values)

numpy.ndarray

In [6]:
# create a new series -> still in numpy format
pd.Series([5, 6, 7, 9])

0    5
1    6
2    7
3    9
dtype: int64

In [7]:
# create a new string series -> still object type from numpy
pd.Series(['foo', 'bar', 'foobar'])

0       foo
1       bar
2    foobar
dtype: object

## 3 - Pandas Arrow Backend
Need to explicit tell api you need arrow backend

In [8]:
# now the type is pyarrow int64
pd.Series([5, 6, 7, 9], dtype='int64[pyarrow]')

0    5
1    6
2    7
3    9
dtype: int64[pyarrow]

In [9]:
# now the type is string instead of object
pd.Series(['foo', 'bar', 'foobar'], dtype='string[pyarrow]')

0       foo
1       bar
2    foobar
dtype: string

### Setting pandas to use arrow by default

In [10]:
pd.options.mode.dtype_backend = 'pyarrow'
pd.options.mode.copy_on_write = True

df_arrow = pd.read_csv('train_peptides.csv', use_nullable_dtypes=True)

In [11]:
df_arrow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981834 entries, 0 to 981833
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype          
---  ------            --------------   -----          
 0   visit_id          981834 non-null  string[pyarrow]
 1   visit_month       981834 non-null  int64[pyarrow] 
 2   patient_id        981834 non-null  int64[pyarrow] 
 3   UniProt           981834 non-null  string[pyarrow]
 4   Peptide           981834 non-null  string[pyarrow]
 5   PeptideAbundance  981834 non-null  double[pyarrow]
dtypes: double[pyarrow](1), int64[pyarrow](2), string[pyarrow](3)
memory usage: 63.9 MB


## 4 - Why arrow ?
- Handle missing value
- Faster (most operations)
- More reasonable data type

### 4-1 Handle missing values in a more make sense way

In [12]:
# for numpy backend if we have a missing value in a integer array
# automatically all the value will be conver to float and None will be convert to NaN which means not a number
# this is not the idea for many cases and also more memory and computation power needed for float
pd.Series([1, 2, 3, 4, None])

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
dtype: float64

In [13]:
# for pyarrow integer remained and and NA value kept as well
# this is all based on arrow's native solution and make things effectivly as well
pd.Series([1, 2, 3, 4, None], dtype='int64[pyarrow]')

0       1
1       2
2       3
3       4
4    <NA>
dtype: int64[pyarrow]

### 4-2 Faster

- Simple operation mean, Pandas Arrow 2.4x faster than Pandas Numpy (Polars > Pandas-arrow > Pandas-numpy)
- Reading csv file, Pandas Arrow 16.1x faster than Pandas Numpy (Pandas-arrow > Polars > Pandas-numpy)
- String operation, Pandas Arrow 30x faster than Pandas Numpy (Pandas-arrow > Polars >> Pandas-numpy)
- Groupby/Agg operation, Polars 14.5x faster than Pandas Numpy, 79x faster than Pandas Arrow (Polars >> Pandas-numpy > Pandas-arrow)

#### 4-2-1 Mean operation

In [40]:
# Old numpy back end
%timeit df['PeptideAbundance'].mean()

537 µs ± 5.41 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [41]:
# Arrow backend
%timeit df_arrow['PeptideAbundance'].mean()

222 µs ± 8.01 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [42]:
# Poarls
%timeit df_arrow['PeptideAbundance'].mean()

211 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [43]:
537/222, 571/211

(2.418918918918919, 2.706161137440758)

#### 4-2-2 Reading file

In [18]:
%%timeit
# Reading in data use numpy
pd.read_csv('train_peptides.csv')

241 ms ± 5.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
%%timeit
# Reading in data use pyarrow
pd.read_csv('train_peptides.csv', engine='pyarrow', use_nullable_dtypes=True)

14.9 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [20]:
%%timeit
# Reading in data use polars
pl.read_csv('train_peptides.csv')

27.1 ms ± 888 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [37]:
241/14.9, 241/27.1

(16.174496644295303, 8.892988929889299)

#### 4-2-3 String operation

In [20]:
%timeit df['UniProt'].str.startswith('N') # numpy backend

68.6 ms ± 849 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [21]:
%timeit df_arrow['UniProt'].str.startswith('N') # arrow backend

2.26 ms ± 52.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [28]:
%timeit df_polars['UniProt'].str.starts_with('N') # polars

3.33 ms ± 45.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [44]:
68.6 /2.26, 68.6/3.33

(30.353982300884955, 20.600600600600597)

#### 4-2-4 Groupby/aggreation/transform operation

In [29]:
%timeit df.groupby(['visit_id'])['PeptideAbundance'].agg(['mean', 'sum', 'max']) # numpy backend

21.8 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [30]:
%timeit df_arrow.groupby(['visit_id'])['PeptideAbundance'].agg(['mean', 'sum', 'max']) # arrow backend

114 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [31]:
%%timeit
# polars agg functions
(
    df_polars.groupby(['visit_id'])
    .agg([
        pl.col('PeptideAbundance').mean().alias('PeptideAbundance_mean'),
        pl.col('PeptideAbundance').sum().alias('PeptideAbundance_sum'),
        pl.col('PeptideAbundance').max().alias('PeptideAbundance_max'),
    ])
)

1.44 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [32]:
%timeit df.groupby('visit_id')['UniProt'].transform(len)

85.1 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [33]:
%timeit df_arrow.groupby('visit_id')['UniProt'].transform(len)

82.4 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 4-3 More reasonable data type

In [29]:
# Take boolean for example -> numpy will use 8 bit to store boolean value
pd.Series([True, False, True, True, True, False]).info()

<class 'pandas.core.series.Series'>
RangeIndex: 6 entries, 0 to 5
Series name: None
Non-Null Count  Dtype
--------------  -----
6 non-null      bool 
dtypes: bool(1)
memory usage: 134.0 bytes


In [30]:
# howevery arrow backend only use 1 bit to store boolean value
pd.Series([True, False, True, True, True, False], dtype='bool[pyarrow]').info()

<class 'pandas.core.series.Series'>
RangeIndex: 6 entries, 0 to 5
Series name: None
Non-Null Count  Dtype        
--------------  -----        
6 non-null      bool[pyarrow]
dtypes: bool[pyarrow](1)
memory usage: 129.0 bytes


In [31]:
# Because numpy backend is not able to handle NA value properly will convert all the type to object
pd.Series([True, False, True, True, True, None])

0     True
1    False
2     True
3     True
4     True
5     None
dtype: object

In [32]:
# howevery arrow won't have this issue, it's still a boolean type
pd.Series([True, False, True, True, True, None], dtype='bool[pyarrow]')

0     True
1    False
2     True
3     True
4     True
5     <NA>
dtype: bool[pyarrow]

In [33]:
articles = pd.DataFrame({
    'title' : pd.Series(['pandas 2.0 and Arrow revolution', 'What I did this weekend'], dtype='string[pyarrow]'),
    #'tags' : pd.Series([['pandas', 'arrow', 'data'], ['scuba-diving', 'rock-climbing']], dtype=pd.ArrowDtype(pa.list_(pa.string()))), support customized data type
    'date' : pd.Series([datetime.date(2023, 2, 22), datetime.date(2022, 11, 3)], dtype='date32[pyarrow]')
})

articles

Unnamed: 0,title,date
0,pandas 2.0 and Arrow revolution,2023-02-22
1,What I did this weekend,2022-11-03


## Conclusion
- Data type: Pandas-arrow has a better support for all kinds data type, especailly for NA values and also support customized data type
- Speed: Most of operations Pandas-arrow is much faster than Pandas-numpy, slightly faster than Polars. But for groupby/aggregation/transform function Pandas-arrow is even slower than Pandas-numpy, I guess the split-preocess-merge is not implemented yet.
- Memory: Pandas-arrow is more memory efficient.

For now my recommendation is Polars > Pandas 2.x > Pandas 1.x, but if don't care groupby/agg/transform functions performance, you can choose Pandas 2.x as well, for large dataset I still recommend Polars as first choice.
