In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:

When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy array or the extension array. Series.array will always return an ExtensionArray, and will never copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying / coercing values.

When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a method, makes it clearer that the returned NumPy array may not be a view on the same data in the DataFrame.

In [2]:
import pandas as pd
import numpy as np
# Create a DataFrame with 1 million rows and two columns
df = pd.DataFrame({
    'a': np.random.rand(1000000),
    'b': np.random.rand(1000000)
})

In [3]:
df.values

array([[0.65281902, 0.38596966],
       [0.81884361, 0.5905908 ],
       [0.73034918, 0.25829396],
       ...,
       [0.66651263, 0.12558555],
       [0.63100811, 0.71387107],
       [0.45298855, 0.91582466]], shape=(1000000, 2))

In [4]:
df.to_numpy()

array([[0.65281902, 0.38596966],
       [0.81884361, 0.5905908 ],
       [0.73034918, 0.25829396],
       ...,
       [0.66651263, 0.12558555],
       [0.63100811, 0.71387107],
       [0.45298855, 0.91582466]], shape=(1000000, 2))

In [5]:
import time
import math

In [6]:
# Create a DataFrame with 1 million rows and one column 'x'
df = pd.DataFrame({'x': np.linspace(0, 10, 1000000)})

# 1. Using vectorized np.sin function
start = time.time()
df['sin_vectorized'] = np.sin(df['x'])
time_vectorized = time.time() - start
print("Time using vectorized np.sin:", time_vectorized, "seconds")

# 2. Using apply with math.sin (processing element by element)
start = time.time()
df['sin_loop'] = df['x'].apply(math.sin)
time_loop = time.time() - start
print("Time using apply with math.sin:", time_loop, "seconds")

Time using vectorized np.sin: 0.029645919799804688 seconds
Time using apply with math.sin: 0.2880251407623291 seconds


pandas has support for accelerating certain types of binary numerical and boolean operations using the numexpr library and the bottleneck libraries.

These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr uses smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that are especially fast when dealing with arrays that have nans.

In [7]:
pd.set_option("compute.use_bottleneck", True)
pd.set_option("compute.use_numexpr", True)
# Create a DataFrame with 1 million rows and one column 'x'
df2 = pd.DataFrame({
    'a': np.random.rand(1000000),
    'b': np.random.rand(1000000)
})

start = time.time()
df2['sum_loop'] = df2.apply(lambda row: row['a'] + row['b'], axis=1)
time_loop = time.time() - start
print("Time using apply with lambda:", time_loop, "seconds")

Time using apply with lambda: 10.989691734313965 seconds


In [8]:
pd.set_option("compute.use_bottleneck", False)
pd.set_option("compute.use_numexpr", False)

# Create a DataFrame with 1 million rows and one column 'x'
df2 = pd.DataFrame({
    'a': np.random.rand(1000000),
    'b': np.random.rand(1000000)
})

start = time.time()
df2['sum_loop'] = df2.apply(lambda row: row['a'] + row['b'], axis=1)
time_loop = time.time() - start
print("Time using apply with lambda:", time_loop, "seconds")

Time using apply with lambda: 12.076786041259766 seconds


Flexibe comparison

In [11]:
df.eq(df2)

Unnamed: 0,a,b,sin_loop,sin_vectorized,sum_loop,x
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
999995,False,False,False,False,False,False
999996,False,False,False,False,False,False
999997,False,False,False,False,False,False
999998,False,False,False,False,False,False


In [13]:
df

Unnamed: 0,x,sin_vectorized,sin_loop
0,0.00000,0.000000,0.000000
1,0.00001,0.000010,0.000010
2,0.00002,0.000020,0.000020
3,0.00003,0.000030,0.000030
4,0.00004,0.000040,0.000040
...,...,...,...
999995,9.99996,-0.543988,-0.543988
999996,9.99997,-0.543996,-0.543996
999997,9.99998,-0.544004,-0.544004
999998,9.99999,-0.544013,-0.544013


In [14]:
(df > 0).all()

x                 True
sin_vectorized    True
sin_loop          True
dtype: bool

In [15]:
(df > 0).any()

x                 True
sin_vectorized    True
sin_loop          True
dtype: bool

In [17]:
df3 = pd.DataFrame()

In [18]:
df3.empty

True

In [19]:
(df + df).equals(df * 2)

True

In [23]:
df['sin_loop'].mode()

0        -1.0
1        -1.0
2        -1.0
3        -1.0
4        -1.0
         ... 
999995    1.0
999996    1.0
999997    1.0
999998    1.0
999999    1.0
Name: sin_loop, Length: 1000000, dtype: float64