dtypes
====

The main types stored in pandas objects are `float`, `int`, `bool`, `datetime64[ns]` and `datetime64[ns, tz]` (in >= 0.17.0), `timedelta[ns]`, `category` (in >= 0.15.0), and `object`. In addition these dtypes have item sizes, e.g. `int64` and `int32`. See [Series with TZ](http://pandas.pydata.org/pandas-docs/version/0.20.3/timeseries.html#timeseries-timezone-series) for more detail on `datetime64[ns, tz]` dtypes.

A convenient [`dtypes`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes) attribute for DataFrames returns a Series with the data type of each column.

存储在pandas对象中的主要类型有 `float`, `int`, `bool`, `datetime64[ns]` 和 `datetime64[ns, tz]` (in >= 0.17.0), `timedelta[ns]`, `category` (in >= 0.15.0), 和 `object`.  此外，这些dtypes具有项目大小，例如， `int64` 和 `int32`. 关于`datetime64[ns, tz]`类型更详细的说明请参见 [Series with TZ](http://pandas.pydata.org/pandas-docs/version/0.20.3/timeseries.html#timeseries-timezone-series) 。

DataFramer的 [`dtypes`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes) 属性将每个列的数据类型返回为一个Series，非常便捷。

In [3]:
import numpy as np
import pandas as pd

dft = pd.DataFrame(dict(A = np.random.rand(3),
                        B = 1,
                        C = 'foo',
                        D = pd.Timestamp('20010102'),
                        E = pd.Series([1.0]*3).astype('float32'),
                        F = False,
                        G = pd.Series([1]*3,dtype='int8')))

In [4]:
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.397938,1,foo,2001-01-02,1.0,False,1
1,0.69715,1,foo,2001-01-02,1.0,False,1
2,0.124704,1,foo,2001-01-02,1.0,False,1


In [6]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

On a `Series` use the [`dtype`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.dtype.html#pandas.Series.dtype) attribute.

在一个`Series`上使用[`dtype`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.Series.dtype.html#pandas.Series.dtype) 属性。

In [7]:
dft['A'].dtype

dtype('float64')

If a pandas object contains data multiple dtypes *IN A SINGLE COLUMN*, the dtype of the column will be chosen to accommodate all of the data types (`object` is the most general).

如果pandas对象*IN A SINGLE COLUMN*包含的数据有多个dtypes ，则列的dtype将被选择以容纳所有数据类型（`object`是最通用的）。

In [8]:
# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

In [10]:
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3      6
4    foo
dtype: object

The method [`get_dtype_counts()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.get_dtype_counts.html#pandas.DataFrame.get_dtype_counts) will return the number of columns of each type in a `DataFrame`:

[`get_dtype_counts()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.get_dtype_counts.html#pandas.DataFrame.get_dtype_counts)方法将返回 `DataFrame`中列的每个类型的数量：

In [11]:
dft.get_dtype_counts()

float64           1
float32           1
int64             1
int8              1
datetime64[ns]    1
bool              1
object            1
dtype: int64

Numeric dtypes will propagate and can coexist in DataFrames (starting in v0.11.0). If a dtype is passed (either directly via the `dtype` keyword, a passed `ndarray`, or a passed `Series`), then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.

数字dtypes将传送并可以在DataFrame中共存（从v0.11.0开始）。 如果传递了dtype（直接通过`dtype`关键字，传递`ndarray`或传递的`Series`），那么它将被保存在DataFrame操作中。此外，不会合并不同的数字dtypes。 下例将会让你有一点领会。

In [13]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

df1

Unnamed: 0,A
0,-0.045061
1,-1.412086
2,-1.585406
3,-0.573533
4,-0.198591
5,-0.765236
6,1.072463
7,-0.139559


In [14]:
df1.dtypes

A    float32
dtype: object

In [16]:
df2 = pd.DataFrame(dict( A = pd.Series(np.random.randn(8), dtype='float16'),
                        B = pd.Series(np.random.randn(8)),
                        C = pd.Series(np.array(np.random.randn(8), dtype='uint8')) ))

df2

Unnamed: 0,A,B,C
0,-0.46167,-0.911275,0
1,-0.08667,0.064364,0
2,0.404541,-1.08194,0
3,-0.133179,1.943328,0
4,-0.867188,-0.899134,254
5,-1.570312,1.180588,0
6,-0.200073,-0.769303,0
7,-0.476562,0.447481,0


In [17]:
df2.dtypes

A    float16
B    float64
C      uint8
dtype: object

**defaults**

By default integer types are `int64` and float types are `float64`, *REGARDLESS* of platform (32-bit or 64-bit). The following will all result in `int64` dtypes.

# 默认值

默认情况下，整数类型是`int64`，浮点类型是`float64`，*不管*是什么与平台(32-bit or 64-bit)。下面的例子结果都是`int64`类型。

In [18]:
pd.DataFrame([1, 2], columns=['a']).dtypes

a    int64
dtype: object

In [19]:
pd.DataFrame({'a': [1, 2]}).dtypes

a    int64
dtype: object

In [20]:
pd.DataFrame({'a': 1 }, index=list(range(2))).dtypes

a    int64
dtype: object

Numpy, however will choose *platform-dependent* types when creating arrays. The following **WILL** result in `int32` on 32-bit platform.

然而，Numpy在创建数组时会选择*平台相关的*类型。 下例**将**在32-bit平台上导致`int32`。

In [21]:
frame = pd.DataFrame(np.array([1, 2]))

frame

Unnamed: 0,0
0,1
1,2


In [22]:
frame.dtypes

0    int32
dtype: object

**upcasting**

Types can potentially be *upcasted* when combined with other types, meaning they are promoted from the current type (say `int` to `float`)

# upcasting

当与其他类型组合时，类型可以被 *upcasted*，这意味着它们是从当前类型提升的（比如说`int`到`float`）

In [23]:
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

df3

Unnamed: 0,A,B,C
0,-0.506731,-0.911275,0.0
1,-1.498756,0.064364,0.0
2,-1.180865,-1.08194,0.0
3,-0.706711,1.943328,0.0
4,-1.065779,-0.899134,254.0
5,-2.335548,1.180588,0.0
6,0.87239,-0.769303,0.0
7,-0.616122,0.447481,0.0


In [24]:
df3.dtypes

A    float32
B    float64
C    float64
dtype: object

The `values` attribute on a DataFrame return the *lower-common-denominator* of the dtypes, meaning the dtype that can accommodate **ALL** of the types in the resulting homogeneous dtyped numpy array. This can force some *upcasting*.

DataFrame上的`values`属性返回 *lower-common-denominator* *更低的-普通的-共同特征*的dtypes, 意味着该类型能够容纳得到的同类的类型化的numpy数组中**ALL**的类型。 这可能强制将一些类型*upcasting*。

In [25]:
df3.values.dtype

dtype('float64')

**astype**

You can use the [`astype()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype) method to explicitly convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass `copy=False` to change this behavior). In addition, they will raise an exception if the astype operation is invalid.

Upcasting is always according to the **numpy** rules. If two different dtypes are involved in an operation, then the more *general* one will be used as the result of the operation.

### astype

可以用[`astype()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype)方法显式地将dtypes从一个类型转换为另一个类型。默认情况下，这些将返回一个副本，即使dtype未更改（传递 `copy=False`以改变这种行为)。此外，如果astype操作无效，他们将触发异常。

**Upcasting** 总是遵循 **numpy** 规则。如果两个不同的dtypes参与一个操作，那么将使用更一般的dtypes作为操作的结果。

In [26]:
df3

Unnamed: 0,A,B,C
0,-0.506731,-0.911275,0.0
1,-1.498756,0.064364,0.0
2,-1.180865,-1.08194,0.0
3,-0.706711,1.943328,0.0
4,-1.065779,-0.899134,254.0
5,-2.335548,1.180588,0.0
6,0.87239,-0.769303,0.0
7,-0.616122,0.447481,0.0


In [27]:
df3.dtypes

A    float32
B    float64
C    float64
dtype: object

In [28]:
# conversion of dtypes
df3.astype('float32').dtypes

A    float32
B    float32
C    float32
dtype: object

Convert a subset of columns to a specified type using [`astype()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype)

用 [`astype()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype)转换列的一个子集到指定的类型。

In [31]:
dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})

dft[['a','b']] = dft[['a','b']].astype(np.uint8)

In [32]:
dft

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [33]:
dft.dtypes

a    uint8
b    uint8
c    int64
dtype: object

New in version 0.19.0.

Convert certain columns to a specific dtype by passing a dict to [`astype()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype)

通过传递一个字典给[`astype()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype)转换特定列到指定的dtype。

In [38]:
dft1 = pd.DataFrame({'a': [1,0,1], 'b': [4,5,6], 'c': [7, 8, 9]})

In [39]:
dft1 = dft1.astype({'a': np.bool, 'c': np.float64})

In [40]:
dft1

Unnamed: 0,a,b,c
0,True,4,7.0
1,False,5,8.0
2,True,6,9.0


In [41]:
dft1.dtypes

a       bool
b      int64
c    float64
dtype: object

**Note**

When trying to convert a subset of columns to a specified type using [`astype()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype) and [`loc()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc), upcasting occurs.

[`loc()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc) tries to fit in what we are assigning to the current dtypes, while `[]` will overwrite them taking the dtype from the right hand side. Therefore the following piece of code produces the unintended result.

尝试用 [`astype()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.astype.html#pandas.DataFrame.astype) 和 [`loc()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc)将列的子集转换为指定type时，将提升类型。

[`loc()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc) 试着适应我们分配给当前dtype的内容，`[]`将从右边提取dtype覆盖它们。 但是下面的代码片段产生出乎意料的结果。

In [42]:
dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})

dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes

a    uint8
b    uint8
dtype: object

In [43]:
dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)

In [44]:
dft.dtypes

a    int64
b    int64
c    int64
dtype: object

**object conversion**

pandas offers various functions to try to force conversion of types from the `object` dtype to other types. The following functions are available for one dimensional object arrays or scalars:

# 对象转换

pandas提供了各种函数来尝试强制将类型从`object` dtype转换为其他类型。 以下函数可用于一维对象数组或标量：

- **[`to_numeric()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.to_numeric.html#pandas.to_numeric) (conversion to numeric dtypes)**

In [45]:
 m = ['1.1', 2, 3]

In [46]:
pd.to_numeric(m)

array([1.1, 2. , 3. ])

- **[`to_datetime()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.to_datetime.html#pandas.to_datetime) (conversion to datetime objects)**

In [49]:
import datetime

m = ['2016-07-09', datetime.datetime(2016, 3, 2)]

m

['2016-07-09', datetime.datetime(2016, 3, 2, 0, 0)]

In [50]:
pd.to_datetime(m)

DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

- **[`to_timedelta()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.to_timedelta.html#pandas.to_timedelta) (conversion to timedelta objects)**

To force a conversion, we can pass in an `errors` argument, which specifies how pandas should deal with elements that cannot be converted to desired dtype or object. By default, `errors='raise'`, meaning that any errors encountered will be raised during the conversion process. However, if `errors='coerce'`, these errors will be ignored and pandas will convert problematic elements to `pd.NaT` (for datetime and timedelta) or `np.nan` (for numeric). This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to represent as missing:

为了强制转换，我们可以传入一个`errors`参数，该参数指定pandas应该如何处理无法转换为所需dtype或object的元素。 默认情况下，`errors ='raise'`，这意味着在转换过程中遇到的任何错误都会被提升。 但是，如果`errors ='coerce'`，这些错误将被忽略，pandas会将有问题的元素转换为`pd.NaT`（对于datetime和timedelta）或`np.nan`（对于数字）。 如果您正在读取大部分所需dtype（例如数字，日期时间）的数据，这可能很有用，但偶尔会混合您要表示为缺失的不合格元素：

In [51]:
import datetime

m = ['apple', datetime.datetime(2016, 3, 2)]

pd.to_datetime(m, errors='coerce')

DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [52]:
m = ['apple', 2, 3]

In [53]:
 pd.to_numeric(m, errors='coerce')

array([nan,  2.,  3.])

In [54]:
m = ['apple', pd.Timedelta('1day')]

In [55]:
pd.to_timedelta(m, errors='coerce')

TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

The `errors` parameter has a third option of `errors='ignore'`, which will simply return the passed in data if it encounters any errors with the conversion to a desired data type:

`errors`参数有第三个选项`errors ='ignore'`，如果遇到转换为所需数据类型的任何错误，它将简单地返回传入的数据：

In [56]:
import datetime

m = ['apple', datetime.datetime(2016, 3, 2)]

pd.to_datetime(m, errors='ignore')

array(['apple', datetime.datetime(2016, 3, 2, 0, 0)], dtype=object)

In [57]:
m = ['apple', 2, 3]

In [58]:
pd.to_numeric(m, errors='ignore')

array(['apple', 2, 3], dtype=object)

In [59]:
m = ['apple', pd.Timedelta('1day')]

In [60]:
pd.to_timedelta(m, errors='ignore')

array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

In addition to object conversion, [`to_numeric()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.to_numeric.html#pandas.to_numeric) provides another argument `downcast`, which gives the option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:

除了对象转换之外, [`to_numeric()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.to_numeric.html#pandas.to_numeric) 提供了另一个参数`downcast`,它提供了将新（或已经）数字数据向下转换为较小的dtype的选项，这可以节省内存： 

In [62]:
 m = ['1', 2, 3]
    
pd.to_numeric(m, downcast='integer')   # smallest signed int dtype

array([1, 2, 3], dtype=int8)

In [63]:
pd.to_numeric(m, downcast='signed')    # same as 'integer'

array([1, 2, 3], dtype=int8)

In [64]:
pd.to_numeric(m, downcast='unsigned')  # smallest unsigned int dtype

array([1, 2, 3], dtype=uint8)

In [65]:
pd.to_numeric(m, downcast='float')     # smallest float dtype

array([1., 2., 3.], dtype=float32)

As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such as DataFrames. However, with [`apply()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply), we can “apply” the function over each column efficiently:

因为这些方法仅适用于一维 arrays, lists 或 scalars; 因此它们不能直接用于多给对象，如DataFrames。然而，使用[`apply()`](http://pandas.pydata.org/pandas-docs/version/0.20.3/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply), 我们可以在每一列上高效地“apply”这个函数：

In [66]:
import datetime

df = pd.DataFrame([['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')

df

Unnamed: 0,0,1
0,2016-07-09,2016-03-02 00:00:00
1,2016-07-09,2016-03-02 00:00:00


In [67]:
df.apply(pd.to_datetime)

Unnamed: 0,0,1
0,2016-07-09,2016-03-02
1,2016-07-09,2016-03-02


In [69]:
df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')

df

Unnamed: 0,0,1,2
0,1.1,2,3
1,1.1,2,3


In [70]:
df.apply(pd.to_numeric)

Unnamed: 0,0,1,2
0,1.1,2,3
1,1.1,2,3


In [71]:
df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')

In [72]:
df

Unnamed: 0,0,1
0,5us,1 days 00:00:00
1,5us,1 days 00:00:00


In [73]:
df.apply(pd.to_timedelta)

Unnamed: 0,0,1
0,00:00:00.000005,1 days
1,00:00:00.000005,1 days


**gotchas**

Performing selection operations on `integer` type data can easily upcast the data to `floating`. The dtype of the input data will be preserved in cases where `nans` are not introduced (starting in 0.11.0) See also [Support for integer NA](http://pandas.pydata.org/pandas-docs/version/0.20.3/gotchas.html#gotchas-intna)

# gotchas 申缩性

对`integer` 类型数据执行选择操作可以轻松地将数据转换为`floating`。 在未引入 `nans`的情况下（从0.11.0开始），将保留输入数据的dtype。另请参阅[Support for integer NA](http://pandas.pydata.org/pandas-docs/version/0.20.3/gotchas.html#gotchas-intna)

In [76]:
dfi = df3.astype('int32')

dfi['E'] = 1

dfi

Unnamed: 0,A,B,C,E
0,0,0,0,1
1,-1,0,0,1
2,-1,-1,0,1
3,0,1,0,1
4,-1,0,254,1
5,-2,1,0,1
6,0,0,0,1
7,0,0,0,1


In [77]:
dfi.dtypes

A    int32
B    int32
C    int32
E    int64
dtype: object

In [80]:
casted = dfi[dfi>0]

casted

Unnamed: 0,A,B,C,E
0,,,,1
1,,,,1
2,,,,1
3,,1.0,,1
4,,,254.0,1
5,,1.0,,1
6,,,,1
7,,,,1


In [83]:
casted.dtypes

A    float64
B    float64
C    float64
E      int64
dtype: object

While float dtypes are unchanged.

浮动dtypes保持不变。

In [84]:
dfa = df3.copy()

In [85]:
dfa['A'] = dfa['A'].astype('float32')

In [86]:
dfa.dtypes

A    float32
B    float64
C    float64
dtype: object

In [87]:
casted = dfa[df2>0]

casted

Unnamed: 0,A,B,C
0,,,
1,,0.064364,
2,-1.180865,,
3,,1.943328,
4,,,254.0
5,,1.180588,
6,,,
7,,0.447481,


In [88]:
casted.dtypes

A    float32
B    float64
C    float64
dtype: object