In [23]:
import pandas as pd
from datetime import date
from dateutil.relativedelta import relativedelta

<h1>There are 3 types of dtype_backend</h1>
<h3>numpy (default for older versions)</h3>
<ul>
    <li><b>Integer</b> columns with <b>missing data</b> are typically upcast to a float64 type, with missing values represented as np.nan (Not a Number).
        This is a significant drawback, as it can be memory-inefficient and change the data type <b>unexpectedly</b></li>
    <li><b>String</b> columns with missing data are often stored as <b>dtype('O')</b>, a generic Python object type, which is <b>less performant</b> than dedicated string types</li>
</ul>

<h3>numpy_nullable</h3>
<ul>
    <li>The numpy_nullable backend uses pandas' specialized extension dtypes to support null values without changing the underlying data type</li>
    <li>Integer columns with missing values remain as integer-based types (Int64, Int32, etc.)</li>
    <li>Dtypes like StringDtype and BooleanDtype provide a more explicit and efficient representation than the generic object dtype</li>
</ul>

<h3>pyarrow</h3>
<ul>
    <li><b>Superior performance</b>: Can provide faster I/O operations (reading and writing data) and more efficient memory usage, especially for string data</li>
    <li><b>Memory efficiency</b>: Arrow-based data often uses less memory than standard NumPy-backed data</li>
    <li><b>Native null handling</b>: Provides robust support for null values within all data types, including integers and booleans</li>
    <li><b>Interoperability</b>: Facilitates seamless data transfer to and from other systems
        that use the Apache Arrow format, such as distributed computing frameworks and databases</li>
</ul>

In [28]:
def get_df():
    today = date.today()
    return pd.DataFrame({
        'name': ['yesterday', 'today', 'tomorrow'],
        'value': [None, today, today + relativedelta(days=1)]
    })

<h1>Without conversion, dataframe is default in numpy dtype_backend</h1>

In [26]:
df = get_df()
df.info()

df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      object
 1   value   2 non-null      object
dtypes: object(2)
memory usage: 180.0+ bytes


Unnamed: 0,name,value
0,yesterday,
1,today,2025-10-29
2,tomorrow,2025-10-30


<h1>Convert to numpy_nullable</h1>

In [27]:
df = get_df().convert_dtypes(dtype_backend='numpy_nullable')
df.info()

df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      string
 1   value   2 non-null      object
dtypes: object(1), string(1)
memory usage: 180.0+ bytes


Unnamed: 0,name,value
0,yesterday,
1,today,2025-10-29
2,tomorrow,2025-10-30


<h1>Convert to pyarrow</h1>

In [30]:
df = get_df().convert_dtypes(dtype_backend='pyarrow')
df.info()

df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   name    3 non-null      string[pyarrow]
 1   value   2 non-null      object         
dtypes: object(1), string[pyarrow](1)
memory usage: 190.0+ bytes


Unnamed: 0,name,value
0,yesterday,
1,today,2025-10-29
2,tomorrow,2025-10-30


<h1>It is not possible to set the dtype_backend globally yet</h1>
<h4>As of pandas 2.3 - no option to set global dtype_backend</h4>
<a href="https://pandas.pydata.org/docs/reference/api/pandas.get_option.html">https://pandas.pydata.org/docs/reference/api/pandas.get_option.html</a>