In [1]:
import pandas as pd
import numpy as np
np.random.seed(seed=42)

### Pandas Query With Categorical Variables

Pandas Query is the other way to filter data, the one that you don't usually use but you might want to consider.

In a previous video, we did .query() with continuous variables (Ex: 1.2, 3, .003). In this tutorial we will be looking at categorical variables (Ex: red, blue, green).

We will run through 2 examples:
1. When a categorical variable matches a single value
2. When a categorical variable is in a list of items

First, let's create our DataFrame

In [7]:
df = pd.DataFrame.from_dict({"Name": ['Liho Liho', 'Tompkins', 'The Square', 'Chambers'],
                             "Type": ['Bar', 'Restaurant', 'Bar', 'Restaurant'],
                             "Location": ['San Francisco', 'Los Angeles', 'New York', 'San Francisco']})
df

Unnamed: 0,Name,Type,Location
0,Liho Liho,Bar,San Francisco
1,Tompkins,Restaurant,Los Angeles
2,The Square,Bar,New York
3,Chambers,Restaurant,San Francisco


### 1. When a categorical variable matches a single value
The most confusing part about .query() is that you need to write a string as if it were regular python code. Below we will query for locations that have "Bar" in the 'type' column.

Notice how we need to wrap bar in quotes even through it is already in a string. This is because python will read the greater string, and evaluate it as if it were *not* a string.

In [8]:
df.query('Type == "Bar"')

Unnamed: 0,Name,Type,Location
0,Liho Liho,Bar,San Francisco
2,The Square,Bar,New York


Notice that all the other rows which *don't* satisfies this query are *not returned*

### 2. When a categorical variable is in a list of items
The same goes for when you're using 'in' to see if a variable 'is in' a list of items.

Below we are looking for any establishment that is in San Francisco or Los Angeles.

In [12]:
df.query('Location in ["San Francisco", "Los Angeles"]')

Unnamed: 0,Name,Type,Location
0,Liho Liho,Bar,San Francisco
1,Tompkins,Restaurant,Los Angeles
3,Chambers,Restaurant,San Francisco


When in doubt, make sure that you're conditional statement works when you're *not* using .query() first.

Like this!

In [14]:
df['Location'].isin(["San Francisco", "Los Angeles"])

0     True
1     True
2    False
3     True
Name: Location, dtype: bool