### The most commonly used methods such as location idexing, label indexing, and slicing range will not be covered. 

### It is known that using bracket to filter values that met a certain criteria is quite powerful in pandas. There are many ways of making this criteria, and I would like to start with isin( ) method.

Note: I add hyperlink under each method in case more usage is needed.

In [1]:
import pandas as pd
df = pd.DataFrame({
    'month':	[1,2,3,4,5,2,3,4,5,1,2,3,4],
    'name':['Adam','Adam','Amia','Amia','Amia','Tim','Tim','Ike','Ike','Linda','Linda','Linda','Linda'],
    'gender':['male','male','female','female','female','male','female','male','male','female','female','female','female'],
    'salary':[2000,2000,2000,2000,2000,1800,1800,1800,1800,1800,1800,1800,1800],
    'bonus':[1560,1800,1550,1700,1600,1300,1300,1100,1330,1100,1320,1370,1330],
    '职位':['CEO','CEO','COO','COO','COO','director','director','director','director','manager','manager','manager','manager']
})
df

Unnamed: 0,month,name,gender,salary,bonus,职位
0,1,Adam,male,2000,1560,CEO
1,2,Adam,male,2000,1800,CEO
2,3,Amia,female,2000,1550,COO
3,4,Amia,female,2000,1700,COO
4,5,Amia,female,2000,1600,COO
5,2,Tim,male,1800,1300,director
6,3,Tim,female,1800,1300,director
7,4,Ike,male,1800,1100,director
8,5,Ike,male,1800,1330,director
9,1,Linda,female,1800,1100,manager


#### [DataFrame/Series.isin()](https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.DataFrame.isin.html)

This method will only return DataFrame of booleans showing whether each element in the DataFrame is contained in values. However, coupled with the [ ] indexing, this method can filter out the desired values.

In [2]:
# to filter whose salary equals to 1800. first let's see what isin() returns. 
df['salary'].isin(['1800']) 

0     False
1     False
2     False
3     False
4     False
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
Name: salary, dtype: bool

In [3]:
# from here, we can filter out value using indexing.
df[df['salary'].isin(['1800'])]

Unnamed: 0,month,name,gender,salary,bonus,职位
5,2,Tim,male,1800,1300,director
6,3,Tim,female,1800,1300,director
7,4,Ike,male,1800,1100,director
8,5,Ike,male,1800,1330,director
9,1,Linda,female,1800,1100,manager
10,2,Linda,female,1800,1320,manager
11,3,Linda,female,1800,1370,manager
12,4,Linda,female,1800,1330,manager


In [4]:
# you can input more than one values as filtering criteria. For example, check who gets bonus at [1100,1300,1800]
df[df['bonus'].isin([1100,1300,1800])]

Unnamed: 0,month,name,gender,salary,bonus,职位
1,2,Adam,male,2000,1800,CEO
5,2,Tim,male,1800,1300,director
6,3,Tim,female,1800,1300,director
7,4,Ike,male,1800,1100,director
9,1,Linda,female,1800,1100,manager


In [5]:
# to get a reversed result, add ~ in front of the criteria.
df[~df['bonus'].isin([1100,1300,1800])]

Unnamed: 0,month,name,gender,salary,bonus,职位
0,1,Adam,male,2000,1560,CEO
2,3,Amia,female,2000,1550,COO
3,4,Amia,female,2000,1700,COO
4,5,Amia,female,2000,1600,COO
8,5,Ike,male,1800,1330,director
10,2,Linda,female,1800,1320,manager
11,3,Linda,female,1800,1370,manager
12,4,Linda,female,1800,1330,manager


In [6]:
# pass dict to check for each column separately. the key should be the column name and the value the criteria.
# let's check for male staff who get salary at 2000.
df[df.isin({'gender':['male'],'salary':[2000]})]
# first, it returns a dataframe with filtered value while others being Nan. Second, it is a OR condition, not AND.
# so, passing dict would not be recommened.

Unnamed: 0,month,name,gender,salary,bonus,职位
0,,,male,2000.0,,
1,,,male,2000.0,,
2,,,,2000.0,,
3,,,,2000.0,,
4,,,,2000.0,,
5,,,male,,,
6,,,,,,
7,,,male,,,
8,,,male,,,
9,,,,,,


#### Beside the numeric values, string values are also commonly seen in data processing. str module provides kinds of powerful methods to deal with string data and str.contains() is one of them.

#### [str.contains()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html)

It will return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index. Therefore, the first parameter passed will be pattern.

In [7]:
sales_data = pd.DataFrame(
    {"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"],
     "region":["East","North","East","South","West","West","South","West","West","East","South"],
     "sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000],
     "expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

In [8]:
sales_data.str.contains('East')
# Note that it must be a Series rather than a DataFrame

AttributeError: 'DataFrame' object has no attribute 'str'

In [9]:
sales_data[sales_data['region'].str.contains('East')]

Unnamed: 0,name,region,sales,expenses
0,William,East,50000,42000
2,Sofia,East,90000,50000
9,Anika,East,65000,44000


In [10]:
# the pattern passed can be part of sequence, because one of the parameter regex is True as default.
sales_data[sales_data['region'].str.contains('Ea')]

Unnamed: 0,name,region,sales,expenses
0,William,East,50000,42000
2,Sofia,East,90000,50000
9,Anika,East,65000,44000


In [11]:
# Specifying case sensitivity using case. 
sales_data[sales_data['region'].str.contains('ea',case=False)]

Unnamed: 0,name,region,sales,expenses
0,William,East,50000,42000
2,Sofia,East,90000,50000
9,Anika,East,65000,44000


In [12]:
# use | to set two parallel filter.
sales_data[sales_data['name'].str.contains('w|m',case=False)]

Unnamed: 0,name,region,sales,expenses
0,William,East,50000,42000
1,Emma,North,52000,43000
3,Markus,South,34000,44000
4,Edward,West,42000,38000
5,Thomas,West,72000,39000


In [13]:
# select those whose name start with 'A'
sales_data[sales_data['name'].str.contains('^a',case=False)]

Unnamed: 0,name,region,sales,expenses
8,Arun,West,67000,39000
9,Anika,East,65000,44000


#### [DataFrame.where()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html) 
Replace values where the condition is False.
#### [DataFrame.mask](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html)
it is the opposite of where() method. it replaces values where the condition is True.

In [14]:
sales_data

Unnamed: 0,name,region,sales,expenses
0,William,East,50000,42000
1,Emma,North,52000,43000
2,Sofia,East,90000,50000
3,Markus,South,34000,44000
4,Edward,West,42000,38000
5,Thomas,West,72000,39000
6,Ethan,South,49000,42000
7,Olivia,West,55000,60000
8,Arun,West,67000,39000
9,Anika,East,65000,44000


In [15]:
sales_data.where(sales_data['region']=='West')

Unnamed: 0,name,region,sales,expenses
0,,,,
1,,,,
2,,,,
3,,,,
4,Edward,West,42000.0,38000.0
5,Thomas,West,72000.0,39000.0
6,,,,
7,Olivia,West,55000.0,60000.0
8,Arun,West,67000.0,39000.0
9,,,,


In [16]:
# as you see, entries where cond is False are replaced with NaN.  
# However, you can specify corresponding value from other parameter.
sales_data.where(sales_data['region']=='West',other='good')

Unnamed: 0,name,region,sales,expenses
0,good,good,good,good
1,good,good,good,good
2,good,good,good,good
3,good,good,good,good
4,Edward,West,42000,38000
5,Thomas,West,72000,39000
6,good,good,good,good
7,Olivia,West,55000,60000
8,Arun,West,67000,39000
9,good,good,good,good


In [17]:
sales_data.mask(sales_data['region']=='West',other='bad')
# mask() is the reverse of where(). Where cond is False, keep the original value. 
# Replace values with what other specifies where the condition is True.

Unnamed: 0,name,region,sales,expenses
0,William,East,50000,42000
1,Emma,North,52000,43000
2,Sofia,East,90000,50000
3,Markus,South,34000,44000
4,bad,bad,bad,bad
5,bad,bad,bad,bad
6,Ethan,South,49000,42000
7,bad,bad,bad,bad
8,bad,bad,bad,bad
9,Anika,East,65000,44000


In [18]:
sales_data

Unnamed: 0,name,region,sales,expenses
0,William,East,50000,42000
1,Emma,North,52000,43000
2,Sofia,East,90000,50000
3,Markus,South,34000,44000
4,Edward,West,42000,38000
5,Thomas,West,72000,39000
6,Ethan,South,49000,42000
7,Olivia,West,55000,60000
8,Arun,West,67000,39000
9,Anika,East,65000,44000


In [19]:
# Using where()/mask() might seem less practical, I used them when labelling values that met certain criteria.
# For example, for those whose sales are greater than 50000 and expenses greater than 30000, label them as 'A'
cond1 = sales_data['sales'] > 50000
cond2 = sales_data['expenses'] > 30000
sales_data['new_label'] = ''
sales_data['new_label'] = sales_data['new_label'].mask(cond1 & cond2,other='A')
sales_data['new_label'] = sales_data['new_label'].where(cond1 & cond2,other='B')
sales_data

Unnamed: 0,name,region,sales,expenses,new_label
0,William,East,50000,42000,B
1,Emma,North,52000,43000,A
2,Sofia,East,90000,50000,A
3,Markus,South,34000,44000,B
4,Edward,West,42000,38000,B
5,Thomas,West,72000,39000,A
6,Ethan,South,49000,42000,B
7,Olivia,West,55000,60000,A
8,Arun,West,67000,39000,A
9,Anika,East,65000,44000,A


In [20]:
# subset a pandas dataframe based on a numeric variable
sales_data.query('sales > 60000')

Unnamed: 0,name,region,sales,expenses,new_label
2,Sofia,East,90000,50000,A
5,Thomas,West,72000,39000,A
8,Arun,West,67000,39000,A
9,Anika,East,65000,44000,A
10,Paulo,South,67000,45000,A


In [21]:
# You can refer to variables in the environment by prefixing them with an ‘@’ character
num = 60000
sales_data.query('sales > @num')

Unnamed: 0,name,region,sales,expenses,new_label
2,Sofia,East,90000,50000,A
5,Thomas,West,72000,39000,A
8,Arun,West,67000,39000,A
9,Anika,East,65000,44000,A
10,Paulo,South,67000,45000,A


In [22]:
# select rows based on a categorical variable
sales_data.query('region == "East"')

Unnamed: 0,name,region,sales,expenses,new_label
0,William,East,50000,42000,B
2,Sofia,East,90000,50000,A
9,Anika,East,65000,44000,A


In [23]:
# subset a dataframe by index
sales_data.query('index < 3')

Unnamed: 0,name,region,sales,expenses,new_label
0,William,East,50000,42000,B
1,Emma,North,52000,43000,A
2,Sofia,East,90000,50000,A


In [24]:
sales_data['index'] = [i+3 for i in range(sales_data.shape[0])]
sales_data

Unnamed: 0,name,region,sales,expenses,new_label,index
0,William,East,50000,42000,B,3
1,Emma,North,52000,43000,A,4
2,Sofia,East,90000,50000,A,5
3,Markus,South,34000,44000,B,6
4,Edward,West,42000,38000,B,7
5,Thomas,West,72000,39000,A,8
6,Ethan,South,49000,42000,B,9
7,Olivia,West,55000,60000,A,10
8,Arun,West,67000,39000,A,11
9,Anika,East,65000,44000,A,12


In [25]:
# My understanding is that DataFrame has a index attribute by default. 
# so I make another index column to see which one it will filter.
sales_data.query('index < 6')
# it turns out the one I made is selected.

Unnamed: 0,name,region,sales,expenses,new_label,index
0,William,East,50000,42000,B,3
1,Emma,North,52000,43000,A,4
2,Sofia,East,90000,50000,A,5


In [26]:
sales_data.drop('index',axis=1,inplace=True)
#use the modulo operator (%) on index to retrieve the "odd" rows of our DataFrame:
sales_data.query('index%2 == 1')

Unnamed: 0,name,region,sales,expenses,new_label
1,Emma,North,52000,43000,A
3,Markus,South,34000,44000,B
5,Thomas,West,72000,39000,A
7,Olivia,West,55000,60000,A
9,Anika,East,65000,44000,A


In [27]:
# subset a pandas dataframe by comparing two columns
sales_data.query('sales < expenses')

Unnamed: 0,name,region,sales,expenses,new_label
3,Markus,South,34000,44000,B
7,Olivia,West,55000,60000,A


In [28]:
# subset a pandas dataframe with multiple conditions
sales_data.query('(sales > 50000) and (region in ["East", "West"])')

Unnamed: 0,name,region,sales,expenses,new_label
2,Sofia,East,90000,50000,A
5,Thomas,West,72000,39000,A
7,Olivia,West,55000,60000,A
8,Arun,West,67000,39000,A
9,Anika,East,65000,44000,A
