# Pandas dtypes

In this activity, we will learn about the data types that Pandas is using in Series and DataFrames. This is very important in order to further understand what we can do with our data in Pandas.

You can continue in the notebook from the previous activity. If you decide to create a new one don't forget to import the packages.

For the most part, Pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns].

However, NumPy doesn't allow non-numeric data types, therefore, Pandas has to extend NumPy's type system in a few places. The following table lists most of Pandas extension types (the most common ones):

| Kind of Data | Data Type | String Aliases |
| :---------- | :----------: | ----------:
| Categorical | CategoricalDtype | 'category' |
| nullable integer | Int64Dtype | 'Int8', 'UInt8', 'Int16', 'UInt16'...|
| Strings | StringDtype | 'string'|
| Boolean (with NA) | BooleanDtype | 'boolean','bool'|
| any|object dtype|'object'|

A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.

In [1]:
import numpy as np
import pandas as pd

In [2]:
dft = pd.DataFrame({'A': np.random.rand(3),
                        'B': 1,
                        'C': 'foo',
                        'D': pd.Timestamp('20010102'),
                        'E': pd.Series([1.0] * 3).astype('float32'),
                        'F': False,
                        'G': pd.Series([1] * 3, dtype='int8')})

In [3]:
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.49946,1,foo,2001-01-02,1.0,False,1
1,0.159119,1,foo,2001-01-02,1.0,False,1
2,0.317692,1,foo,2001-01-02,1.0,False,1


In [5]:
dft.dtypes

A          float64
B            int64
C           object
D    datetime64[s]
E          float32
F             bool
G             int8
dtype: object

Series has the same attribute as well:

In [6]:
dft['A'].dtype

dtype('float64')

Pandas has two ways of storing strings.

    object dtype, which can hold any Python object, including strings.
    StringDtype, which is dedicated to strings (introduced in 2020, only in the Pandas 1.0.0 version)

It is recommended to use StringDtype for strings because an object can hide any data type inside. 

In [7]:
pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3    6.0
4    foo
dtype: object

## Converting

You can use the astype() method to explicitly convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition, they will raise an exception if the astype() operation is invalid.

In [8]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [9]:
df1.dtypes

A    float32
dtype: object

In [10]:
df1 = df1.astype('float64')

In [11]:
df1.dtypes

A    float64
dtype: object

You can .astype() on a subset of columns as well, even on a single column, a.k.a. Series

Convert certain columns to a specific dtype by passing a dict to astype().

In [12]:
dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [14]:
dft1 = dft1.astype({'a': bool, 'c': np.float64})

In [15]:
dft1

Unnamed: 0,a,b,c
0,True,4,7.0
1,False,5,8.0
2,True,6,9.0


In [17]:
dft1.dtypes

a       bool
b      int64
c    float64
dtype: object