![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/dataprep/how-to-guides/filtering.png)

# Filtering
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

Azure ML Data Prep has the ability to filter out columns or rows using `Dataflow.drop_columns` or `Dataflow.filter`.

In [None]:
# initial set up
import azureml.dataprep as dprep
from datetime import datetime
dflow = dprep.read_csv(path='../data/crime-spring.csv')
dflow.head(5)

## Filtering columns

To filter columns, use `Dataflow.drop_columns`. This method takes a list of columns to drop or a more complex argument called `ColumnSelector`.

### Filtering columns with list of strings

In this example, `drop_columns` takes a list of strings. Each string should exactly match the desired column to drop.

In [None]:
dflow = dflow.drop_columns(['ID', 'Location Description', 'Ward', 'Community Area', 'FBI Code'])
dflow.head(5)

### Filtering columns with regex

Alternatively, a `ColumnSelector` can be used to drop columns that match a regex expression. In this example, we drop all the columns that match the expression `Column*|.*longitud|.*latitude`.

In [None]:
dflow = dflow.drop_columns(dprep.ColumnSelector('Column*|.*longitud|.*latitude', True, True))
dflow.head(5)

## Filtering rows

To filter rows, use `DataFlow.filter`. This method takes an `Expression` as an argument, and returns a new dataflow with the rows in which the expression evaluates to `True`. Expressions are built by indexing the `Dataflow` with a column name (`dataflow['myColumn']`) and regular operators (`>`, `<`, `>=`, `<=`, `==`, `!=`).

### Filtering rows with simple expressions

Index into the Dataflow specifying the column name as a string argument `dataflow['column_name']` and in combination with one of the following standard operators `>, <, >=, <=, ==, !=`, build an expression such as `dataflow['District'] > 9`.  Finally, pass the built expression into the `Dataflow.filter` function.

In this example, `dataflow.filter(dataflow['District'] > 9)` returns a new dataflow with the rows in which the value of "District" is greater than '10' 

*Note that "District" is first converted to numeric, which allows us to build an expression comparing it against other numeric values.*

In [None]:
dflow = dflow.to_number(['District'])
dflow = dflow.filter(dflow['District'] > 9)
dflow.head(5)

### Filtering rows with complex expressions

To filter using complex expressions, combine one or more simple expressions with the operators `&`, `|`, and `~`. Please note that the precedence of these operators is lower than that of the comparison operators; therefore, you'll need to use parentheses to group clauses together. 

In this example, `Dataflow.filter` returns a new dataflow with the rows in which "Primary Type" equals 'DECEPTIVE PRACTICE' and "District" is greater than or equal to '10'.

In [None]:
dflow = dflow.to_number(['District'])
dflow = dflow.filter((dflow['Primary Type'] == 'DECEPTIVE PRACTICE') & (dflow['District'] >= 10))
dflow.head(5)

It is also possible to filter rows combining more than one expression builder to create a nested expression.

*Note that `'Date'` and `'Updated On'` are first converted to datetime, which allows us to build an expression comparing it against other datetime values.*

In [None]:
dflow = dflow.to_datetime(['Date', 'Updated On'], ['%Y-%m-%d %H:%M:%S'])
dflow = dflow.to_number(['District', 'Y Coordinate'])
comparison_date = datetime(2016,4,13)
dflow = dflow.filter(
    ((dflow['Date'] > comparison_date) | (dflow['Updated On'] > comparison_date))
    | ((dflow['Y Coordinate'] > 1900000) & (dflow['District'] > 10.0)))
dflow.head(5)