In [1]:
import pandas as pd
import numpy as np

In [2]:
columns = ['Division', 'Qualification', 'Gender', 'Channel_of_Recruitment', 
                'Marital_Status', 'Foreign_schooled']

For the purpose of this tutorial we also use the columns above. Feel free to expore the entire data set

In [3]:
data = pd.read_csv('train.csv', usecols = columns)

data.head(5)

Unnamed: 0,Division,Qualification,Gender,Channel_of_Recruitment,Foreign_schooled,Marital_Status
0,Commercial Sales and Marketing,"MSc, MBA and PhD",Female,Direct Internal process,No,Married
1,Customer Support and Field Operations,First Degree or HND,Male,Agency and others,Yes,Married
2,Commercial Sales and Marketing,First Degree or HND,Male,Direct Internal process,Yes,Married
3,Commercial Sales and Marketing,First Degree or HND,Male,Agency and others,Yes,Single
4,Information and Strategy,First Degree or HND,Male,Direct Internal process,Yes,Married


In [None]:
data.columns

In [None]:
def num_unique(df, var):

    return df[var].unique()

### Pandas isin Syntax

**Dataframe.`isin`(values)**

**Parameter**: The function takes a single value (dictionary, iterable, list or series) which runs against a vectorised boolean expression and filter the dataframe based on the values passed as arguments.

**Returns**: The result is a dataframe of booleans showing whether each element in the DataFrame is contained in the values.

In [None]:
for var in data.columns:
    print(num_unique(data, var))

### Filtering a Single Column with Pandas Isin

1. Filter dataframe to get only those with `'MSc, MBA and PhD` qualification
2. Filter dataframe to get only those employed through `Referral and Special candidates`
3. Filter dataframe to get only those in `Customer Support and Field Operations` division

#### 1. Filter dataframe to get only those with `MBA and PhD` qualification

In [None]:
data_1 = data[data['Qualification'].isin(['MSc, MBA and PhD'])]

data_1

#### 2. Filter dataframe to get only those employed through `Referral and Special candidates`

In [None]:
data_2 = data[data['Channel_of_Recruitment'].isin(['Referral and Special candidates'])]

data_2


#### 3. Filter dataframe to get only those in `Customer Support and Field Operations` division

In [None]:
data_3 = data[data['Division'].isin(['Customer Support and Field Operations'])]

data_3

### Filtering Multiple Columns with Pandas Isin

In this section we would like at the following used cases on filtering multiple columns with pandas isin method

1. filter dataframe to include `female` in `Information Technology and Solution Support` division with `MSc MBA and PhD` Qualification
2. filter dataframe to include `male` employed through `Direct Internal process` with `First Degree or HND` Qualification
3. filter dataframe to include those in `Customer Support and Field Operations` with `Non-University Education` and employed throgh `Direct Internal process` and are `married`

#### 1. filter dataframe to include `female` in `Information Technology and Solution Support division` with `MSc MBA and PhD` Qualification

In [None]:
data_4 = data[data[['Gender', 'Division', 'Qualification']].
                isin(['Female', 'Information Technology and Solution Support', 'MSc, MBA and PhD']).all(axis=1)]

In [None]:
data_4.head(6)

#### 2. filter dataframe to include `male` employed through `Direct Internal process` with `First Degree or HND` Qualification

In [None]:
data_4 = data[data[['Gender', 'Division', 'Qualification']].
                isin(['Female', 'Information Technology and Solution Support', 'MSc, MBA and PhD']).all(axis=1)]

In [None]:
data_5 = data[data[['Gender', 'Channel_of_Recruitment', 'Qualification']].isin(
    ['Male', 'Direct Internal process', 'First Degree or HND']
).all(axis=1)]

print(data_5.shape)
data_5.head(6)

#### 3. filter dataframe to include those in `Customer Support and Field Operations` with `Non-University Education` and employed throgh `Direct Internal process` and are `Married`

In [None]:
data_6 = data[data[['Division', 'Qualification', 'Channel_of_Recruitment', 'Marital_Status']].isin(
    ['Customer Support and Field Operations', 'Non-University Education', 'Direct Internal process', 'Married']
).all(axis=1)]

print(data_6.shape)
data_6.tail(4)

Similiarly, we can also filter our dataframe to print columns if any of the condition above is met by chaining our data to `.any` method. The code below mean filter dataframe to include those in `Customer Support and Field Operations` or with `Non-University Education` or employed through `Direct Internal process` or are `Married`

In [None]:
data_7 = data[data[['Division', 'Qualification', 'Channel_of_Recruitment', 'Marital_Status']].isin(
    ['Customer Support and Field Operations', 'Non-University Education', 'Direct Internal process', 'Married']
).any(axis=1)]


data_7.head(3)

The code below means give me those employees that either have `MSc, MBA and PhD` **or** `Non-University Education` qualification

In [None]:
data_8 = data[data[['Qualification', 'Marital_Status']].isin(
    ['MSc, MBA and PhD', 'Non-University Education']
).any(axis=1)]


data_8.head(3)

In [None]:
data_8

### Filtering Dataframe using Pandas Isin **Not** Matching Condition

We can use pandas unary operation (`~`) to perform `NOT IN` selection

You can learn more about python unary operator [here](https://orclqa.com/python-unary-operator/#:~:text=A%20unary%20operator%20is%20an,preceded%20by%20the%20unary%20operator.)

In [None]:
# select employees that are not in Commercial Sales and Marketing or Research and Innovation
# that is, select all employees except those in Research and Innovation or Commercial Sales and Marketing division
data_9 = data[~data['Division'].isin(['Commercial Sales and Marketing', 'Research and Innovation'])]

data_9

In [None]:
# get all employee except those in Customer Support and Field Operations
# and those that have not had foreign education

data_10 = data[~data[['Division', 'Foreign_schooled']].isin(['Customer Support and Field Operations', 'No']).any(axis = 1)]

data_10

In [None]:
data_10['Division'].unique()

In [None]:
col_missing = [
    var for var in data.columns if data[var].isnull().sum() > 0
]

col_missing

In [None]:
print(data['Qualification'].isnull().sum())
print(data['Qualification'].unique())