![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/dataprep/how-to-guides/column-type-transforms.png)

# Column Type Transforms


When consuming a data set, it is highly useful to know as much as possible about the data. Column types can help you understand more about each column, and enable type-specific transformations later. This provides much more insight than treating all data as strings.

In this notebook, you will learn about:
- [Built-in column types](#types)
- How to:
 - [Convert to long (integer)](#long)
 - [Convert to double (floating point or decimal number)](#double)
 - [Convert to boolean](#boolean)
 - [Convert to datetime](#datetime)
- [How to use `ColumnTypesBuilder` to get suggested column types and convert them](#builder)
- [How to convert column type for multiple columns if types are known](#multiple-columns)

## Set up

In [None]:
import azureml.dataprep as dprep

In [None]:
dflow = dprep.read_csv('../data/crime-winter.csv')
dflow = dflow.keep_columns(['Case Number', 'Date', 'IUCR', 'Arrest', 'Longitude', 'Latitude'])

<a id="types"></a>

## Built-in column types

Currently, Data Prep supports the following column types: string, long (integer), double (floating point or decimal number), boolean, and datetime.

In the previous step, a data set was read in as a Dataflow, with only a few interesting columns kept. We will use this Dataflow to explore column types throughout the notebook.

In [None]:
dflow.head(5)

From the first few rows of the Dataflow, you can see that the columns contain different types of data. However, by looking at `dtypes`, you can see that `read_csv()` treats all columns as string columns.

Note that `auto_read_file()` is a data ingestion function that infers column types. Learn more about it [here](./auto-read-file.ipynb).

In [None]:
dflow.dtypes

<a id="long"></a>

### Converting to long (integer)

Suppose the "IUCR" column should only contain integers. You can call `to_long` to convert the column type of "IUCR" to `FieldType.INTEGER`. If you look at the data profile ([learn more about data profiles](./data-profile.ipynb)), you will see numeric metrics populated for that column such as mean, variance, quantiles, etc. This is helpful for understanding the shape and distribution of numeric data.

In [None]:
dflow_conversion = dflow.to_long('IUCR')
profile = dflow_conversion.get_profile()
profile

<a id="double"></a>

### Converting to double (floating point or decimal number)

Suppose the "Latitude" and "Longitude" columns should only contain decimal numbers. You can call `to_double` to convert the column type of "Latitude" and "Longitude" to `FieldType.DECIMAL`. In the data profile, you will see numeric metrics populated for these columns as well. Note that after converting the column types, you can see that there are missing values in these columns. Metrics like this can be helpful for noticing issues with the data set.

In [None]:
dflow_conversion = dflow_conversion.to_number(['Latitude', 'Longitude'])
profile = dflow_conversion.get_profile()
profile

<a id="boolean"></a>

### Converting to boolean

Suppose the "Arrest" column should only contain boolean values. You can call `to_bool` to convert the column type of "Arrest" to `FieldType.BOOLEAN`.

The `to_bool` function allows you to specify which values should map to `True` and which values should map to `False`. To do so, you can provide those values in an array as parameters `true_values` and `false_values`. Additionally, you can specify whether all other values should become `True`, `False` or Error by using the `mismatch_as` parameter.

In [None]:
dflow_conversion.to_bool('Arrest', 
                         true_values=[1],
                         false_values=[0],
                         mismatch_as=dprep.MismatchAsOption.ASERROR).head(5)

In the previous conversion, all the values in the "Arrest" column became `DataPrepError`, because 'FALSE' didn't match any of the `false_values` nor any of the `true_values`, and all the unmatched values were set to become errors. Let's try the conversion again with different `false_values`.

In [None]:
dflow_conversion = dflow_conversion.to_bool('Arrest',
                                            true_values=['1', 'TRUE'],
                                            false_values=['0', 'FALSE'],
                                            mismatch_as=dprep.MismatchAsOption.ASERROR)
dflow_conversion.head(5)

This time, all the string values 'FALSE' have been successfully converted to the boolean value `False`. Take another look at the data profile.

In [None]:
profile = dflow_conversion.get_profile()
profile

<a id="datetime"></a>

Suppose the "Date" column should only contain datetime values. You can convert its column type to `FieldType.DateTime` using the `to_datetime` function. Typically, datetime formats can be confusing or inconsistent. Next, we will show you all the tools that can help correctly converting the column to `DateTime`.

In the first example, directly call `to_datetime` with only the column name. Data Prep will inspect the data in this column and learn what format should be used for the conversion.

Note that if there is data in the column that cannot be converted to datetime, an Error value will be created in that cell.

In [None]:
dflow_conversion_date = dflow_conversion.to_datetime('Date')
dflow_conversion_date.head(5)

In this case, we can see that '1/10/2016 11:00' was converted using the format `%m/%d/%Y %H:%M`.

The data in this column is actually somewhat ambiguous. Should the dates be 'October 1' or 'January 10'? The function `to_datetime` determines that both are possible, but defaults to month-first (US format).

If the data was supposed to be day-first, you can customize the conversion.

In [None]:
dflow_alternate_conversion = dflow_conversion.to_datetime('Date', date_time_formats=['%d/%m/%Y %H:%M'])
dflow_alternate_conversion.head(5)

<a id="builder"></a>

## Using `ColumnTypesBuilder`

Data Prep can help you automatically detect what are the likely column types.

You can call `dflow.builders.set_column_types()` to get a `ColumnTypesBuilder`. Then, calling `learn()` on it will trigger Data Prep to inspect the data in each column. As a result, you can see the suggested column types for each column (conversion candidates).

In [None]:
builder = dflow.builders.set_column_types()
builder.learn()
builder

In this case, Data Prep suggested the correct column types for "Arrest", "Case Number", "Latitude", and "Longitude".

However, for "Date", it has suggested two possible date formats: month-first, or day-first. The ambiguity must be resolved before you complete the conversion. To use the month-first format, you can call `builder.ambiguous_date_conversions_keep_month_day()`. Otherwise, call `builder.ambiguous_date_conversions_keep_day_month()`. Note that if there were multiple datetime columns with ambiguous date conversions, calling one of these functions will apply the resolution to all of them.

If you want to skip all the ambiguous date column conversions instead, you can call: `builder.ambiguous_date_conversions_drop()`

In [None]:
builder.ambiguous_date_conversions_keep_month_day()
builder.conversion_candidates

The conversion candidate for "IUCR" is currently `FieldType.INTEGER`. If you know that "IUCR" should be floating point (called `FieldType.DECIMAL`), you can tweak the builder to change the conversion candidate for that specific column. 

In [None]:
builder.conversion_candidates['IUCR'] = dprep.FieldType.DECIMAL
builder

In this case we are happy with "IUCR" as `FieldType.INTEGER`. So we set it back. 

In [None]:
builder.conversion_candidates['IUCR'] = dprep.FieldType.INTEGER
builder

Once you are happy with the conversion candidates, you can complete the conversion by calling `builder.to_dataflow()`.

In [None]:
dflow_converion_using_builder = builder.to_dataflow()
dflow_converion_using_builder.head(5)

<a id="multiple-columns"></a>

## Convert column types for multiple columns

If you already know the column types, you can simply call `dflow.set_column_types()`. This function allows you to specify multiple columns, and the desired column type for each one. Here's how you can convert all five columns at once.

Note that `set_column_types` only supports a subset of column type conversions. For example, we cannot specify the true/false values for a boolean conversion, so the results of this operation is incorrect for the "Arrest" column.

In [None]:
dflow_conversion_using_set = dflow.set_column_types({
    'IUCR': dprep.FieldType.INTEGER,
    'Latitude': dprep.FieldType.DECIMAL,
    'Longitude': dprep.FieldType.DECIMAL,
    'Arrest': dprep.FieldType.BOOLEAN,
    'Date': (dprep.FieldType.DATE, ['%m/%d/%Y %H:%M']),
})
dflow_conversion_using_set.head(5)