# Data Types and Missing Values

One of the most important pieces of information you can have about your DataFrame is the data type of each column. pandas stores its data such that each column is exactly one data type. A large number of data types are available for pandas DataFrame columns. This chapter focuses only on the most common data types and provides a brief summary of each one. For extensive coverage of each and every data type, see part **05. Data Types**.

## Common data types

The following are the most common data types that appear frequently in DataFrames. 

* **boolean** - Only two possible values, `True` and `False`
* **integer** - Whole numbers without decimals
* **float** - Numbers with decimals
* **object** - Almost always strings, but can technically contain any Python object 
* **datetime** - Specific date and time with nanosecond precision

### The importance of knowing the data type

Knowing the data type of each column of your pandas DataFrame is very important. The main reason for this is that every value in each column will be of the same type. For instance, if you select a single value from a column that has an integer data type, then you are guaranteed that this value is also an integer. Knowing the data type of a column is one of the most fundamental pieces of knowledge of your DataFrame.

### The exception with the object data type

The object data type is the most confusing and deserves a longer discussion. It is an exception to the message in the last section. Each value in an object column can be any Python object. Object columns can contain integers, floats, or even data structures such as lists or dictionaries. Anything can be contained in object columns. But, nearly all of the time, columns with the object data type only contain strings. When you see a column with the object data type, you should expect the values to be strings. If you do have strings in your column values, the data type will be object, but you are not guaranteed that all values will be strings.

## String data type - major enhancement to pandas 1.0

Before the release of pandas version 1.0, there was no dedicated string data type. This was a huge limitation and caused numerous problems. pandas still has the 'object' data type, which is capable of holding strings.

With the addition of the string data type, we are guaranteed that every value will be a string in a column with string data type. This new data type is still labeled as "experimental" in the pandas documentation, so I do not suggest using it for serious work yet. There are many bugs that need to be fixed and behavior sorted out before it is ready to use. Until then, this book will continue to use the object data type for columns containing strings.

## Missing value representation

Datasets often have missing values and need to have some representation to identify them. Pandas uses the object `NaN` and `NaT` to represent them.

* `NaN` - "Not a Number"
* `NaT` - "Not a Time"

### Missing values for each data type

The missing value representation depends on the data type of the column. For our common data types, we have the following missing value representation for each.

* **boolean** - No missing value representation
* **integer** - No missing value representation
* **float** -  `NaN`
* **object** - `NaN`
* **datetime** - `NaT`

### Missing values in boolean and integer columns

Knowing that a column is either a boolean or integer column guarantees that there are no missing values in that column, as pandas does not allow for it. If, for instance, you would like to place a missing value in a boolean or integer column, then pandas would convert the entire column to float. This is because a float column can accommodate missing values. When booleans are converted to floats, False becomes 0 and True becomes 1.

## New Integers and booleans data types in pandas 1.0

Two new data types, the **nullable integer** and **nullable boolean** are now available in pandas 1.0. These are completely different data types than the original integer and boolean data types and have slightly different behavior. The main difference is that they do have missing value representation.


### Pandas NA - A new missing value representation for pandas 1.0

Previously, pandas relied on the numpy library to supply it's primary missing value, NaN, which continues to exist. With the release of version 1.0, pandas created it's own missing value representation, NA. This is a new and experimental addition, so its behavior can change.

## Recommendation for Pandas 1.0 - Avoid the new data types

I recommend not using the new string, nullable integer, and nullable boolean data types along with the pandas NA until there has been more development with them. They are still experimental and their behavior can change. I've personally found several bugs and strange behavior using them and would wait until they are more stable. There will be a chapter dedicated to these new data types in part **05. Data Types** with more information.

## Finding the data type of each column

The `dtypes` DataFrame attribute (NOT a method) returns the data type of each column and is one of the first commands you should execute after reading in your data. Let's begin by using the `read_csv` function to read in the bikes dataset. 

In [None]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')
bikes.head(3)

Let's get the data types of each column in our `bikes` DataFrame. This returns a Series object with the data types as the values and the column names as the index.

In [None]:
bikes.dtypes

### Object data types hold string columns

By default, pandas reads in columns containing strings as the object data type. When you see object as the data type, you should think "string".

### The `starttime` and `stoptime` columns are not datetimes

From the visual display of the bikes DataFrame above, it appears that both the `starttime` and `stoptime` columns are datetimes. However, the result of the `dtypes` attribute shows that they are strings. Unfortunately, the `read_csv` function does not automatically read in these columns as datetimes. It requires that you provide it a list of columns that are datetimes to the `parse_dates` parameter, otherwise it will read them in as strings. Let's reread the data using the `parse_dates` parameter.

In [None]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.dtypes.head()

### What are all those 64's at the end of the data types?

Booleans, integers, floats, and datetimes all use a specific amount of memory for each value. The memory is measured in **bits**. The number of bits used for each value is the number appended to the end of the data type name. For instance, integers can be either 8, 16, 32, or 64 bits while floats can be 16, 32, 64, or 128. A 128-bit float column will show up as `float128`. 

Technically a `float128` is a different data type than a `float64` but generally you will not have to worry about such a distinction as the operations between different float columns will be the same. Booleans are always stored as 8-bits. There is no set bit size for object columns as each value can be of any size.

## Getting more metadata

**Metadata** can be defined as data on the data. The data type of each column is an example of metadata. The number of rows and columns is another piece of metadata. We find this with the `shape` attribute, which returns a tuple of integers representing the number of rows and columns of the DataFrame.

In [None]:
bikes.shape

### Use the `len` function to get the number of rows

Pass the DataFrame to the built-in `len` function to return the number of rows as an integer.

In [None]:
len(bikes)

You can also get the number of rows as an integer by selecting the first item of the tuple return from `shape`. Either way is acceptable.

In [None]:
bikes.shape[0]

Similarly, you can get the number of columns as an integer by selecting the second item.

In [None]:
bikes.shape[1]

### Total number of values with the `size` attribute

The `size` attribute returns the total number of values (the number of columns multiplied by the number of rows) in the DataFrame.

In [None]:
bikes.size

### Get data types plus more with the `info` method

The `info` DataFrame method provides output similar to `dtypes`, but also shows the number of non-missing values in each column along with more info such as:  

* Type of object (always a DataFrame)
* The type of index and number of rows
* The number of columns
* The data types of each column and the number of non-missing (a.k.a non-null)
* The frequency count of all data types
* The total memory usage

The information is printed to the screen. It does not return any object.

In [None]:
bikes.info()

## More data types

There are many more data types available in pandas. An extensive and formal discussion on all data types is available in the part **05. Data Types**.

## Exercises
Use the `bikes` DataFrame for the following:

### Exercise 1
<span  style="color:green; font-size:16px">What type of object is returned from the `dtypes` attribute?</span>

### Exercise 2
<span  style="color:green; font-size:16px">What type of object is returned from the `shape` attribute?</span>

### Exercise 3

<span  style="color:green; font-size:16px">The memory usage from the `info` method isn't correct when you have objects in your DataFrame. Read the docstrings from it and get the true memory usage.</span>