### Filling while reindexing
reindex() takes an optional parameter method which is a filling method chosen from the following table:

1. pad / ffill    ---->>  Fill values forward

2. bfill / backfill   ------>>Fill values backward

3. nearest   ----->>>Fill from the nearest index value

In [1]:
import pandas as pd
import numpy as np

In [17]:
rng = pd.date_range('1/3/2000', periods=8)
ts = pd.Series(np.random.randn(8), index=rng)
ts2 = ts[[0, 3, 6]]

In [19]:
ts

2000-01-03    0.979906
2000-01-04   -0.829857
2000-01-05    0.113339
2000-01-06   -0.639766
2000-01-07    0.417060
2000-01-08    0.785277
2000-01-09    1.097579
2000-01-10   -0.372327
Freq: D, dtype: float64

In [20]:
ts2

2000-01-03    0.979906
2000-01-06   -0.639766
2000-01-09    1.097579
dtype: float64

In [21]:
ts2.reindex(ts.index)

2000-01-03    0.979906
2000-01-04         NaN
2000-01-05         NaN
2000-01-06   -0.639766
2000-01-07         NaN
2000-01-08         NaN
2000-01-09    1.097579
2000-01-10         NaN
Freq: D, dtype: float64

In [22]:
ts2.reindex(ts.index, method='ffill')

2000-01-03    0.979906
2000-01-04    0.979906
2000-01-05    0.979906
2000-01-06   -0.639766
2000-01-07   -0.639766
2000-01-08   -0.639766
2000-01-09    1.097579
2000-01-10    1.097579
Freq: D, dtype: float64

In [23]:
ts2.reindex(ts.index, method='bfill')

2000-01-03    0.979906
2000-01-04   -0.639766
2000-01-05   -0.639766
2000-01-06   -0.639766
2000-01-07    1.097579
2000-01-08    1.097579
2000-01-09    1.097579
2000-01-10         NaN
Freq: D, dtype: float64

In [24]:
ts2.reindex(ts.index, method='nearest')

2000-01-03    0.979906
2000-01-04    0.979906
2000-01-05   -0.639766
2000-01-06   -0.639766
2000-01-07   -0.639766
2000-01-08    1.097579
2000-01-09    1.097579
2000-01-10    1.097579
Freq: D, dtype: float64

These methods require that the indexes are ordered increasing or decreasing.

Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:

In [26]:
ts2.reindex(ts.index).fillna(method='ffill')

2000-01-03    0.979906
2000-01-04    0.979906
2000-01-05    0.979906
2000-01-06   -0.639766
2000-01-07   -0.639766
2000-01-08   -0.639766
2000-01-09    1.097579
2000-01-10    1.097579
Freq: D, dtype: float64

reindex() will raise a ValueError if the index is not monotonically increasing or decreasing. fillna() and interpolate() will not perform any checks on the order of the index.



### Limits on filling while reindexing
The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches:

In [30]:
ts2.reindex(ts.index, method='ffill', limit=1)

2000-01-03    0.979906
2000-01-04    0.979906
2000-01-05         NaN
2000-01-06   -0.639766
2000-01-07   -0.639766
2000-01-08         NaN
2000-01-09    1.097579
2000-01-10    1.097579
Freq: D, dtype: float64

In contrast, tolerance specifies the maximum distance between the index and indexer values:



In [28]:
ts2.reindex(ts.index, method='ffill', tolerance='1 day')

2000-01-03    0.979906
2000-01-04    0.979906
2000-01-05         NaN
2000-01-06   -0.639766
2000-01-07   -0.639766
2000-01-08         NaN
2000-01-09    1.097579
2000-01-10    1.097579
Freq: D, dtype: float64

### Dropping labels from an axis
A method closely related to reindex is the drop() function. It removes a set of labels from an axis:

In [33]:
df = pd.DataFrame(np.random.randn(5,3), index=list('abcde'), columns=list('ABC'))
df

Unnamed: 0,A,B,C
a,0.270335,-1.335537,0.119973
b,1.080195,-1.177311,2.549449
c,0.404073,0.719405,0.098326
d,-0.513578,0.047042,1.56004
e,0.038612,0.404348,0.146171


In [35]:
df.drop(list('ab'), axis=0)

Unnamed: 0,A,B,C
c,0.404073,0.719405,0.098326
d,-0.513578,0.047042,1.56004
e,0.038612,0.404348,0.146171


In [41]:
df.drop(list('AB'), axis=1)

Unnamed: 0,C
a,0.119973
b,2.549449
c,0.098326
d,1.56004
e,0.146171


Note that the following also works, but is a bit less obvious / clean:



In [42]:
df.reindex(df.index.difference(list('ab')))

Unnamed: 0,A,B,C
c,0.404073,0.719405,0.098326
d,-0.513578,0.047042,1.56004
e,0.038612,0.404348,0.146171


### Renaming / mapping labels
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [43]:
s = pd.Series(np.arange(5), index=list('abcde'))
s

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [44]:
s.rename(str.upper)

A    0
B    1
C    2
D    3
E    4
dtype: int64

In [47]:
s.rename({'a':'Neil', 'b':'Shivani'})

Neil       0
Shivani    1
c          2
d          3
e          4
dtype: int64

If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique values). A dict or Series can also be used:

In [48]:
df

Unnamed: 0,A,B,C
a,0.270335,-1.335537,0.119973
b,1.080195,-1.177311,2.549449
c,0.404073,0.719405,0.098326
d,-0.513578,0.047042,1.56004
e,0.038612,0.404348,0.146171


In [49]:
df.rename(columns={'A':'Neil', 'B':'Shivani'}, index={'a':'1', 'b':2})

Unnamed: 0,Neil,Shivani,C
1,0.270335,-1.335537,0.119973
2,1.080195,-1.177311,2.549449
c,0.404073,0.719405,0.098326
d,-0.513578,0.047042,1.56004
e,0.038612,0.404348,0.146171


If the mapping doesn’t include a column/index label, it isn’t renamed. Note that extra labels in the mapping don’t throw an error.

DataFrame.rename() also supports an “axis-style” calling convention, where you specify a single mapper and the axis to apply that mapping to.

In [50]:
df.rename({'A':'Anisha'}, axis='columns')

Unnamed: 0,Anisha,B,C
a,0.270335,-1.335537,0.119973
b,1.080195,-1.177311,2.549449
c,0.404073,0.719405,0.098326
d,-0.513578,0.047042,1.56004
e,0.038612,0.404348,0.146171


In [51]:
df.rename({'a':1,'b':2, 'c':3}, axis='index')

Unnamed: 0,A,B,C
1,0.270335,-1.335537,0.119973
2,1.080195,-1.177311,2.549449
3,0.404073,0.719405,0.098326
d,-0.513578,0.047042,1.56004
e,0.038612,0.404348,0.146171


The rename() method also provides an inplace named parameter that is by default False and copies the underlying data. Pass inplace=True to rename the data in place.

Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.

In [53]:
s.rename('Neil_series', inplace=True)

a    0
b    1
c    2
d    3
e    4
Name: Neil_series, dtype: int64

In [55]:
s.name

'Neil_series'

The methods DataFrame.rename_axis() and Series.rename_axis() allow specific names of a MultiIndex to be changed (as opposed to the labels).

In [56]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6],
'y': [10, 20, 30, 40, 50, 60]},
index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]],
names=['let', 'num']))

In [57]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,x,y
let,num,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,10
a,2,2,20
b,1,3,30
b,2,4,40
c,1,5,50
c,2,6,60


In [58]:
df.rename_axis(index={'let':'Neil'})

Unnamed: 0_level_0,Unnamed: 1_level_0,x,y
Neil,num,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,10
a,2,2,20
b,1,3,30
b,2,4,40
c,1,5,50
c,2,6,60


In [59]:
df.rename_axis(index=str.upper)

Unnamed: 0_level_0,Unnamed: 1_level_0,x,y
LET,NUM,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,10
a,2,2,20
b,1,3,30
b,2,4,40
c,1,5,50
c,2,6,60


### Iteration
The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. DataFrames follow the dict-like convention of iterating over the “keys” of the objects.

In short, basic iteration (for i in object) produces:

1. Series: values

2. DataFrame: column labels

Thus, for example, iterating over a DataFrame gives you the column names:

In [60]:
 df = pd.DataFrame({'col1': np.random.randn(3),'col2': np.random.randn(3)}, index=['a', 'b', 'c'])

In [61]:
df

Unnamed: 0,col1,col2
a,1.6285,-1.248908
b,0.191749,1.034255
c,1.220937,-1.997662


In [63]:
for i in df:
    print(i)

col1
col2


Pandas objects also have the dict-like items() method to iterate over the (key, value) pairs.

To iterate over the rows of a DataFrame, you can use the following methods:

1. iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to Series objects, which can change the dtypes and has some performance implications.

2. itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster than iterrows(), and is in most cases preferable to use to iterate over the values of a DataFrame.

#### Warning:
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:

1. Look for a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing, …

2. When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply() instead of iterating over the values. See the docs on function application.

3. If you need to do iterative manipulations on the values but performance is important, consider writing the inner loop with cython or numba. See the enhancing performance section for some examples of this approach.


You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!

For example, in the following case setting the value has no effect:

In [64]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
df

Unnamed: 0,a,b
0,1,a
1,2,b
2,3,c


In [65]:
for index, row in df.iterrows():
    row['a'] = 10
df

Unnamed: 0,a,b
0,1,a
1,2,b
2,3,c


### .dt accessor
Series has an accessor to succinctly return datetime like properties for the values of the Series, if it is a datetime/period like Series. This will return a Series, indexed like the existing Series.

In [67]:
s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))

In [68]:
s

0   2013-01-01 09:10:12
1   2013-01-02 09:10:12
2   2013-01-03 09:10:12
3   2013-01-04 09:10:12
dtype: datetime64[ns]

In [69]:
s.dt.date

0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
dtype: object

In [70]:
s.dt.hour

0    9
1    9
2    9
3    9
dtype: int64

In [71]:
s.dt.second

0    12
1    12
2    12
3    12
dtype: int64

This enables nice expressions like this:



In [72]:
s[s.dt.day==2]

1   2013-01-02 09:10:12
dtype: datetime64[ns]

You can easily produces tz aware transformations:

In [74]:
s2 = s.dt.tz_localize('Europe/Warsaw')

In [75]:
s2

0   2013-01-01 09:10:12+01:00
1   2013-01-02 09:10:12+01:00
2   2013-01-03 09:10:12+01:00
3   2013-01-04 09:10:12+01:00
dtype: datetime64[ns, Europe/Warsaw]

In [76]:
s2.dt.tz

<DstTzInfo 'Europe/Warsaw' LMT+1:24:00 STD>

You can also chain these types of operations:

In [77]:
s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

0   2013-01-01 04:10:12-05:00
1   2013-01-02 04:10:12-05:00
2   2013-01-03 04:10:12-05:00
3   2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

You can also format datetime values as strings with Series.dt.strftime() which supports the same format as the standard strftime().

In [78]:
s = pd.Series(pd.date_range('20130101', periods=4))

In [79]:
s.dt.strftime('%Y/%m/%d')

0    2013/01/01
1    2013/01/02
2    2013/01/03
3    2013/01/04
dtype: object

In [80]:
s = pd.Series(pd.period_range('20130101', periods=4))

In [81]:
s

0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
dtype: period[D]

In [82]:
s.dt.strftime('%Y/%m/%d')

0    2013/01/01
1    2013/01/02
2    2013/01/03
3    2013/01/04
dtype: object

The .dt accessor works for period and timedelta dtypes.

In [84]:
s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))

In [85]:
s

0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
dtype: period[D]

In [86]:
s.dt.day

0    1
1    2
2    3
3    4
dtype: int64

In [87]:
s.dt.hour

0    0
1    0
2    0
3    0
dtype: int64

In [88]:
s.dt.year

0    2013
1    2013
2    2013
3    2013
dtype: int64

In [89]:
s = pd.Series(pd.timedelta_range('1 day 00:00:05', periods=4, freq='s'))

In [90]:
s

0   1 days 00:00:05
1   1 days 00:00:06
2   1 days 00:00:07
3   1 days 00:00:08
dtype: timedelta64[ns]

In [91]:
s.dt.days

0    1
1    1
2    1
3    1
dtype: int64

In [92]:
s.dt.seconds

0    5
1    6
2    7
3    8
dtype: int64

In [93]:
s.dt.components

Unnamed: 0,days,hours,minutes,seconds,milliseconds,microseconds,nanoseconds
0,1,0,0,5,0,0,0
1,1,0,0,6,0,0,0
2,1,0,0,7,0,0,0
3,1,0,0,8,0,0,0


### Note:
Series.dt will raise a TypeError if you access with a non-datetime-like values.



### Vectorized string methods
Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods. For example:

In [94]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],dtype="string")

In [95]:
s

0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

In [97]:
s.str.upper()

0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

In [98]:
s.str.len()

0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

In [99]:
s.str.islower()

0    False
1    False
2    False
3    False
4    False
5     <NA>
6    False
7     True
8     True
dtype: boolean

Powerful pattern-matching methods are provided as well, but note that pattern-matching generally uses regular expressions by default (and in some cases always uses them).

### Sorting
Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination of both.

### By index
The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its index levels.

In [100]:
df = pd.DataFrame({
    'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
     'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [101]:
unsorted_df = df.reindex(index=['b','d','c','a'], columns=['three', 'two','one'])
unsorted_df

Unnamed: 0,three,two,one
b,-1.211166,-0.625813,1.48525
d,-0.229166,-0.394165,
c,-0.283297,-0.136146,-0.962099
a,,-1.373752,-0.523786


In [102]:
unsorted_df.sort_index()

Unnamed: 0,three,two,one
a,,-1.373752,-0.523786
b,-1.211166,-0.625813,1.48525
c,-0.283297,-0.136146,-0.962099
d,-0.229166,-0.394165,


In [103]:
unsorted_df.sort_index(ascending=False)

Unnamed: 0,three,two,one
d,-0.229166,-0.394165,
c,-0.283297,-0.136146,-0.962099
b,-1.211166,-0.625813,1.48525
a,,-1.373752,-0.523786


In [104]:
unsorted_df.sort_index(axis=1)

Unnamed: 0,one,three,two
b,1.48525,-1.211166,-0.625813
d,,-0.229166,-0.394165
c,-0.962099,-0.283297,-0.136146
a,-0.523786,,-1.373752


In [105]:
unsorted_df['three'].sort_index()

a         NaN
b   -1.211166
c   -0.283297
d   -0.229166
Name: three, dtype: float64

Sorting by index also supports a key parameter that takes a callable function to apply to the index being sorted. For MultiIndex objects, the key is applied per-level to the levels specified by level.

### By values
The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values() method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.sort_values() may used to specify one or more columns to use to determine the sorted order.

In [106]:
df1 = pd.DataFrame({'one': [2, 1, 1, 1],
               'two': [1, 3, 2, 4], 'three': [5, 4, 3, 2]})
df1

Unnamed: 0,one,two,three
0,2,1,5
1,1,3,4
2,1,2,3
3,1,4,2


In [107]:
df1.sort_values(by='two')

Unnamed: 0,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


The by parameter can take a list of column names, e.g.:

In [109]:
df1.sort_values(by=['one','two'])

Unnamed: 0,one,two,three
2,1,2,3
1,1,3,4
3,1,4,2
0,2,1,5


These methods have special treatment of NA values via the na_position argument:

In [110]:
s

0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

In [111]:
s[2] = np.nan

In [112]:
s.sort_values()

0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2    <NA>
5    <NA>
dtype: string

In [113]:
s.sort_values(na_position='first')

2    <NA>
5    <NA>
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: string

Sorting also supports a key parameter that takes a callable function to apply to the values being sorted.



In [119]:
s3 = pd.Series(['B','C','a'])

In [120]:
s3

0    B
1    C
2    a
dtype: object

In [121]:
s3.sort_values()

0    B
1    C
2    a
dtype: object

In [125]:
s3.sort_values(key=lambda x: x.str.lower()) # in new version

TypeError: sort_values() got an unexpected keyword argument 'key'

key will be given the Series of values and should return a Series or array of the same shape with the transformed values. For DataFrame objects, the key is applied per column, so the key should still expect a Series and return a Series, e.g.m

key will be given the Series of values and should return a Series or array of the same shape with the transformed values. For DataFrame objects, the key is applied per column, so the key should still expect a Series and return a Series, e.g.

In [126]:
df = pd.DataFrame({"a": ['B', 'a', 'C'], "b": [1, 2, 3]})

In [127]:
df.sort_values(by='a')

Unnamed: 0,a,b
0,B,1
2,C,3
1,a,2


In [128]:
df.sort_values(by='a', key=lambda col: col.str.lower()) # in new version

TypeError: sort_values() got an unexpected keyword argument 'key'

The name or type of each column can be used to apply different functions to different columns.



### smallest / largest values¶
Series has the nsmallest() and nlargest() methods which return the smallest or largest n values. For a large Series this can be much faster than sorting the entire Series and calling head(n) on the result.

In [130]:
s = pd.Series(np.random.permutation(10))

In [131]:
s

0    2
1    6
2    4
3    9
4    7
5    3
6    8
7    0
8    1
9    5
dtype: int64

In [132]:
s.sort_values()

7    0
8    1
0    2
5    3
2    4
9    5
1    6
4    7
6    8
3    9
dtype: int64

In [133]:
s.nsmallest(3)

7    0
8    1
0    2
dtype: int64

In [134]:
s.nlargest(3)

3    9
6    8
4    7
dtype: int64

DataFrame also has the nlargest and nsmallest methods.

In [136]:
df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
             'b': list('abdceff'),
                   'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
 

In [137]:
df.nlargest(3,'a')

Unnamed: 0,a,b,c
5,11,f,3.0
3,10,c,3.2
4,8,e,


In [139]:
df.nlargest(3,['a','c'])

Unnamed: 0,a,b,c
5,11,f,3.0
3,10,c,3.2
4,8,e,


In [140]:
df.nsmallest(3, 'a')

Unnamed: 0,a,b,c
0,-2,a,1.0
1,-1,b,2.0
6,-1,f,4.0


In [142]:
df.nsmallest(3, ['a','c'])

Unnamed: 0,a,b,c
0,-2,a,1.0
1,-1,b,2.0
6,-1,f,4.0


The following functions are available for one dimensional object arrays or scalars to perform hard conversion of objects to a specified type:

1. to_numeric() (conversion to numeric dtypes)

In [143]:
m = ['1.1', 2, 3]

pd.to_numeric(m)

array([1.1, 2. , 3. ])

to_datetime() (conversion to datetime objects)

In [145]:
import datetime

m = ['2016-07-09', datetime.datetime(2016, 3, 2)]

pd.to_datetime(m)

DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

to_timedelta() (conversion to timedelta objects)

In [146]:
m = ['5us', pd.Timedelta('1day')]

In [147]:
pd.to_timedelta(m)

TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with elements that cannot be converted to desired dtype or object. By default, errors='raise', meaning that any errors encountered will be raised during the conversion process. However, if errors='coerce', these errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta) or np.nan (for numeric). This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to represent as missing:

In [148]:
import datetime

m = ['apple', datetime.datetime(2016, 3, 2)]

pd.to_datetime(m, errors='coerce')

DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [149]:
m = ['apple', 2, 3]

pd.to_numeric(m, errors='coerce')

array([nan,  2.,  3.])

In [150]:
m = ['apple', pd.Timedelta('1day')]

pd.to_timedelta(m, errors='coerce')

TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

The errors parameter has a third option of errors='ignore', which will simply return the passed in data if it encounters any errors with the conversion to a desired data type:

In [151]:
import datetime

m = ['apple', datetime.datetime(2016, 3, 2)]

pd.to_datetime(m, errors='ignore')

Index(['apple', 2016-03-02 00:00:00], dtype='object')

In [152]:
m = ['apple', 2, 3]

pd.to_numeric(m, errors='ignore')

array(['apple', 2, 3], dtype=object)

In [154]:
m = ['apple', pd.Timedelta('1day')]
pd.to_timedelta(m, errors='ignore')

array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

In addition to object conversion, to_numeric() provides another argument downcast, which gives the option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:



In [155]:
m = ['1', 2, 3]

pd.to_numeric(m, downcast='integer')   # smallest signed int dtype

array([1, 2, 3], dtype=int8)

In [156]:
pd.to_numeric(m, downcast='signed')    # same as 'integer'

array([1, 2, 3], dtype=int8)

In [157]:
pd.to_numeric(m, downcast='unsigned')  # smallest unsigned int dtype

array([1, 2, 3], dtype=uint8)

In [158]:
pd.to_numeric(m, downcast='float')     # smallest float dtyp

array([1., 2., 3.], dtype=float32)

As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over each column efficiently:

In [159]:
import datetime

df = pd.DataFrame([
['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')

In [160]:
df

Unnamed: 0,0,1
0,2016-07-09,2016-03-02 00:00:00
1,2016-07-09,2016-03-02 00:00:00


In [161]:
df.apply(pd.to_datetime)

Unnamed: 0,0,1
0,2016-07-09,2016-03-02
1,2016-07-09,2016-03-02


In [162]:
df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')
df

Unnamed: 0,0,1,2
0,1.1,2,3
1,1.1,2,3


In [163]:
df.apply(pd.to_numeric)

Unnamed: 0,0,1,2
0,1.1,2,3
1,1.1,2,3


In [164]:
df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')
df

Unnamed: 0,0,1
0,5us,1 days 00:00:00
1,5us,1 days 00:00:00


In [165]:
df.apply(pd.to_timedelta)

Unnamed: 0,0,1
0,00:00:00.000005,1 days
1,00:00:00.000005,1 days
