## Introduction

In this article we will be learning how to perform filtering operations and particularly here it will be based on two types of filter operations.

1. **Relational filters:** These operations will be working on the basis of relational operators that we are aware of like equal to, less than equal to or greater than equal to **(=, <=, >=)**.

2. **Logical filters:** These are the operations which needs to be performed when we are dealing with multiple conditions specifically and here we will be discuss all three main types of logical operators and they are, **AND(&), OR(|), NOT(~)**.

If you are new to PySpark then I'll suggest you to follow my Pyspark series

1. Getting started with PySpark using Python
2. Data Preprocessing using PySpark - PySpark's DataFrame
3. Data preprocessing using PySpark - Handling missing values
4. Data Preprocessing using PySpark - Aggregate and GroupBy functions
5. Introduction to PySpark MLIB

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 35 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 42.2 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=7232292e1bc9b0af9951eed9866e3ee9466238ff02ef659794df85b09463ecfa
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Mandatory Steps
 If you are already following my PySpark series then you would easily know what steps I'm gonna perform now. Let's first discuss them in the nutshell:

1. First we have **imported the SparkSession** to start the PySpark Session.
2. Then with the help of **getOrCreate()** function we have created our session of Apache Spark.
3. At the last we saw what our **spark object** holds in a graphical format.

**Note:** I have discussed these steps in detail in my first article, [Getting Started with PySpark Using Python](https://www.analyticsvidhya.com/blog/2022/04/getting-started-with-pyspark-using-python/)

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('filter_operations').getOrCreate()

In [4]:
spark

## Reading the dataset

In this section we will be reading and **storing the intance of our dummy dataset** with header and Scehma as True which will give us the exact information about the table and its column types.

In [5]:
df_filter_pyspark = spark.read.csv('/content/part2.2.csv', header = True, inferSchema=True)
df_filter_pyspark.show()

+-------+------+-------------+---------+
|EmpName|EmpAge|EmpExperience|EmpSalary|
+-------+------+-------------+---------+
| Oliver|    31|           10|    30000|
|  Harry|    30|            8|    25000|
| George|    29|            4|    20000|
|   Jack|    24|            3|    20000|
|  Jacob|    21|            1|    15000|
|    Leo|    23|            2|    18000|
|  Oscar|  null|         null|    40000|
|   null|    34|           10|    38000|
|   null|    36|         null|     null|
+-------+------+-------------+---------+



## Relational filtering

Here comes the section where we will be doing hands on filtering techniques and in relational filteration we can use different operators like **less than, less than equal to, greater than, greater than equal to and equal to.**

In [6]:
df_filter_pyspark.filter("EmpSalary<=25000").show()

+-------+------+-------------+---------+
|EmpName|EmpAge|EmpExperience|EmpSalary|
+-------+------+-------------+---------+
|  Harry|    30|            8|    25000|
| George|    29|            4|    20000|
|   Jack|    24|            3|    20000|
|  Jacob|    21|            1|    15000|
|    Leo|    23|            2|    18000|
+-------+------+-------------+---------+



**Inference:** Here we can see that the records are filtered out where employees has the salary less than or equal to 25000.

**Selecting the relevant columns instead of showing al the columns** 

This is one of the best cost effective technique in terms of execution time as when working with the large dataset if we will retrieve all the records (all columns) then it will take more execution time but if we know what records we want to see then we can easily choose selected columns as mentioned below

In [7]:
df_filter_pyspark.filter("EmpSalary<=25000").select(['EmpName','EmpAge']).show()

+-------+------+
|EmpName|EmpAge|
+-------+------+
|  Harry|    30|
| George|    29|
|   Jack|    24|
|  Jacob|    21|
|    Leo|    23|
+-------+------+



**Code breakdown:**

So the above code can be broken down into three simple steps for acchieving the goal:

1. This particular filter operation can also come into the category of multiple filering as in first condtion we are filtering out the employees based on the salary i.e. when employee's salary is less than 25000.
2. Then comes the main condition where we are selecting the two columns "EmpName" and "EmpAge" using the select function.
3. At the last showing the filtered DataFrame using show function.

**Note:** Similarly we can use other operators of relational type according to the problem statement we just need to replace the operator and we are good to go.

**Another approach of selecting the columns**

Here we will be looking at one more way where we can select our desired columns and get the exact same result like in the previous output

**Tip:** By looking at this line of code one will get reminded about how **Pandas** used to filter the columns.

In [8]:
df_filter_pyspark.filter(df_filter_pyspark['EmpSalary']<=25000).select(['EmpName','EmpAge']).show()

+-------+------+
|EmpName|EmpAge|
+-------+------+
|  Harry|    30|
| George|    29|
|   Jack|    24|
|  Jacob|    21|
|    Leo|    23|
+-------+------+



**Inference:** In the output we can clearly see that we got the exact same result as we got in previous filter operation. The only change we can see here is the way how we selected the records based on the salary - **df_filter_pyspark['EmpSalary']<=25000** here we have first took the object and entered the name of the column then at the last simply we added the filter condition just like we used to do in **Pandas**.

## Logical filtering 

In this section we will be using different cases to filter out the records based on multiple conditions and for that we will be having three different cases

1. AND condition 
2. OR condtion 
3. NOT condition

**"AND" condtion:** The one familiar with SQL or any programming language in which they have to deal with manipulation of data they are well aware of the fact that when we will be using AND operation then it means all the conditions needs to be **TRUE** i.e. if any of the condition will be **false** then there would not be any output shown.

**Note:** In PySpark we use **"&"** symbol to denote the **AND** operation.

In [9]:
df_filter_pyspark.filter((df_filter_pyspark['EmpSalary']<=30000)
                          & (df_filter_pyspark['EmpSalary']>=18000)).show()

+-------+------+-------------+---------+
|EmpName|EmpAge|EmpExperience|EmpSalary|
+-------+------+-------------+---------+
| Oliver|    31|           10|    30000|
|  Harry|    30|            8|    25000|
| George|    29|            4|    20000|
|   Jack|    24|            3|    20000|
|    Leo|    23|            2|    18000|
+-------+------+-------------+---------+



**Code breakdown:**
Here we can see that we used two conditions one where salary of the employee is less than equal to 30000 & (AND) greater than equal to 18000 i.e. the records which falls into this bracket will be shown in the results other records will be skipped.

1. Condtion 1: **df_filter_pyspark['EmpSalary']<=30000** where salary is greater than 30000
2. Condtion 2: **df_filter_pyspark['EmpSalary']<=18000** where salary is less than 18000
3. Then we used **"&"** operation to filter out the records and at the last **show() function** to give the results.

**"OR" condition:** This condtion is basically used when we don't want to get very stiff with filteration i.e. when we want to access the records if any of the condition is True **unlike AND condition** where all the conditon needs to be True. So be careful to use this OR condtion only when you know either of the conditon can be picked.

**Note:** In PySpark we use "|" symbol to denote the **OR operation**.

In [10]:
df_filter_pyspark.filter((df_filter_pyspark['EmpSalary']<=30000)
                          | (df_filter_pyspark['EmpExperience']>=3)).show()

+-------+------+-------------+---------+
|EmpName|EmpAge|EmpExperience|EmpSalary|
+-------+------+-------------+---------+
| Oliver|    31|           10|    30000|
|  Harry|    30|            8|    25000|
| George|    29|            4|    20000|
|   Jack|    24|            3|    20000|
|  Jacob|    21|            1|    15000|
|    Leo|    23|            2|    18000|
|   null|    34|           10|    38000|
+-------+------+-------------+---------+



**Code breakdown:** If one will compare the results of AND and OR then they would get the difference of using both of them and the right time to use it according to the problem statement. Let's look at how we have used OR operation:

1. Condition 1: **df_filter_pyspark['EmpSalary']<=30000** here we were plucking out the person who have salary less than equal to 30000.
2. Condition 2: **df_filter_pyspark['EmpExperience']>=3** here we were getting the records where employee's experience is greater than equal to 3 years.
3. For combining both of the condition we used **"|" (OR)** operations and at the end used show function to give the result in the form of DataFrame.

**"NOT" condition:** This is the condition where we have to **counter the condition** i.e. we have to do everything else the condition which we have specified itself if we try to simplify more then we can say that **if the condition is False then only NOT operation will work**.

Note: In PySpark we use "~" symbol to denote the **NOT operation**.

In [11]:
df_filter_pyspark.filter(~(df_filter_pyspark['EmpAge']>=30)).show()

+-------+------+-------------+---------+
|EmpName|EmpAge|EmpExperience|EmpSalary|
+-------+------+-------------+---------+
| George|    29|            4|    20000|
|   Jack|    24|            3|    20000|
|  Jacob|    21|            1|    15000|
|    Leo|    23|            2|    18000|
+-------+------+-------------+---------+



**Inference:** Here we can see how the employee who has age greater than equal to **30 doesn't even appeared in the list of records** so it is clear that if the condition is False then only there is credibility of **NOT opeation**

**Note:** While using NOT ("~") we need to keep one thing in mind that we can't use multiple condition here as long as we are not combining it with other **logical condition like "AND"/"OR" **

## Key takeaways from this article

In this section we will summarize everything we did previously like as we started from setting up the environment for the Python's distribution of PySpark then we had head towards performing the both relational and logical filtering on our dummy dataset.

1. Firstly, we completed our mandatory steps of setting up the Spark Session and reading the dataset as these are the pillars of further analysis.
2. Then we got to know about Relation filtering techniques which includes hands on operations using PySpark DataFrame here we discussed about a single operator and learned how to implement other basis on the same approach.
3. At the last we move to second type of filtering i.e. logical filtering where we discussed all three types of it which were AND, OR and Not condition