![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/column-manipulations.png)

# Column Manipulations
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.<br>

Azure ML Data Prep has many methods for manipulating columns, including basic CUD operations and several other more complex manipulations.

This notebook will focus primarily on data-agnostic operations. For all other column manipulation operations, we will link to their specific how-to guide.

## Table of Contents
[ColumnSelector](#ColumnSelector)<br>
[add_column](#add_column)<br>
[append_columns](#append_columns)<br>
[drop_columns](#drop_columns)<br>
[duplicate_column](#duplicate_column)<br>
[fuzzy_group_column](#fuzzy_group_column)<br>
[keep_columns](#keep_columns)<br>
[map_column](#map_column)<br>
[new_script_column](#new_script_column)<br>
[rename_columns](#rename_columns)<br>


<a id="ColumnSelector"></a>

## ColumnSelector
`ColumnSelector` is a Data Prep class that allows us to select columns by name. The idea is to be able to describe columns generally instead of explicitly, using a search term or regex expression, with various options.

Note that a `ColumnSelector` does not represent the columns they match themselves, but the selector of the described columns. Therefore if we use the same `ColumnSelector` on two different dataflows, we may get different results depending on the columns of each dataflow.

Column manipulations that can utilize `ColumnSelector` will be noted in their respective sections in this book.

In [None]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

All parameters to a `ColumnSelector` are shown here for completeness. We will use `keep_columns` in our example, which will keep only the columns in the dataflow that we tell it to keep.

In the below example, we match all columns with the letter 'i'. Because we set `ignore_case` to false and `match_whole_word` to false, then any column that contains 'i' or 'I' will be selected.

In [None]:
from azureml.dataprep import ColumnSelector
column_selector = ColumnSelector(term="i",
                                 use_regex=False,
                                 ignore_case=True,
                                 match_whole_word=False,
                                 invert=False)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

If we set `invert` to true, we get the opposite of what we matched earlier.

In [None]:
column_selector = ColumnSelector(term="i",
                                 use_regex=False,
                                 ignore_case=True,
                                 match_whole_word=False,
                                 invert=True)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

If we change the search term to 'I' and set case sensitivity to true, we get only the handful of columns that contain an upper case 'I'.

In [None]:
column_selector = ColumnSelector(term="I",
                                 use_regex=False,
                                 ignore_case=False,
                                 match_whole_word=False,
                                 invert=False)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

And if we set `match_whole_word` to true, we get no results at all as there is no column called 'I'.

In [None]:
column_selector = ColumnSelector(term="I",
                                 use_regex=False,
                                 ignore_case=False,
                                 match_whole_word=True,
                                 invert=False)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

Finally, the `use_regex` flag dictates whether or not to treat the search term as a regex. It can be combined still with the other options.

Here we define all columns that begin with the capital letter 'I'.

In [None]:
column_selector = ColumnSelector(term="I.*",
                                 use_regex=True,
                                 ignore_case=True,
                                 match_whole_word=True,
                                 invert=False)
dflow_selected = dflow.keep_columns(column_selector)
dflow_selected.head(5)

<a id="add_column"></a>

## add_column

Please see [add-column-using-expression](add-column-using-expression.ipynb).

<a id="append_columns"></a>

## append_columns

Please see [append-columns-and-rows](append-columns-and-rows.ipynb).

<a id="drop_columns"></a>

## drop_columns

Data Prep supports dropping columns one or more columns in a single statement. Supports `ColumnSelector`.

In [None]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Note that there are 22 columns to begin with. We will now drop the 'ID' column and observe that the resulting dataflow contains 21 columns.

In [None]:
dflow_dropped = dflow.drop_columns('ID')
dflow_dropped.head(5)

We can also drop more than one column at once by passing a list of column names.

In [None]:
dflow_dropped = dflow_dropped.drop_columns(['IUCR', 'Description'])
dflow_dropped.head(5)

<a id="duplicate_column"></a>

## duplicate_column

Data Prep supports duplicating columns one or more columns in a single statement.

Duplicated columns are placed to the immediate right of their source column.

In [None]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

We decide which column(s) to duplicate and what the new column name(s) should be with a key value pairing (dictionary).

In [None]:
dflow_dupe = dflow.duplicate_column({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})
dflow_dupe.head(5)

<a id="fuzzy_group_column"></a>

## fuzzy_group_column

Please see [fuzzy-group](fuzzy-group.ipynb).

<a id="keep_columns"></a>

## keep_columns

Data Prep supports keeping one or more columns in a single statement. The resulting dataflow will contain only the column(s) specified; dropping all the other columns. Supports `ColumnSelector`.

In [None]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

In [None]:
dflow_keep = dflow.keep_columns(['ID', 'Date', 'Description'])
dflow_keep.head(5)

Similar to `drop_columns`, we can pass a single column name or a list of them.

In [None]:
dflow_keep = dflow_keep.keep_columns('ID')
dflow_keep.head(5)

<a id="map_column"></a>

## map_column

Data Prep supports string mapping. For a column containing strings, we can provide specific mappings from an original value to a new value, and then produce a new column that contains the mapped values.

The mapped columns are placed to the immediate right of their source column.

In [None]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

In [None]:
from azureml.dataprep import ReplacementsValue
replacements = [ReplacementsValue('THEFT', 'THEFT2'), ReplacementsValue('BATTERY', 'BATTERY!!!')]
dflow_mapped = dflow.map_column(column='Primary Type', 
                                new_column_id='Primary Type V2',
                                replacements=replacements)
dflow_mapped.head(5)

<a id="new_script_column"></a>

## new_script_column

Please see [custom-python-transforms](custom-python-transforms.ipynb).

<a id="rename_columns"></a>

## rename_columns

Data Prep supports renaming one or more columns in a single statement.

In [None]:
from azureml.dataprep import auto_read_file
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

We decide which column(s) to rename and what the new column name(s) should be with a key value pairing (dictionary).

In [None]:
dflow_renamed = dflow.rename_columns({'ID': 'ID2', 'IUCR': 'IUCR_Clone'})
dflow_renamed.head(5)