# Tutorial 4

**This tutorial will cover:**

* Using filter operations to retrieve data based on some condition:
    * `&` (and)
    * `|` (or)
    * `==` (equal to)
    * `<` (less than)
    * `>` (greater than)
    * `<=` (less than or equal to)
    * `>=` (greater than or equal to)
    * `~` (not)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Practice").getOrCreate()

df1 = spark.read.csv("test-data-4.csv", header=True, inferSchema=True)
df1.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



## Filter operations

In [2]:
# Use the `filter()` method to retrieve the people who have salaries less than or equal to 20000.
df1.filter("Salary <= 20000").show()

+-------+---+----------+------+
|   Name|age|Experience|Salary|
+-------+---+----------+------+
|  Sunny| 29|         4| 20000|
|   Paul| 24|         3| 20000|
| Harsha| 21|         1| 15000|
|Shubham| 23|         2| 18000|
+-------+---+----------+------+



In [3]:
# You can also use this syntax:
df1.filter(df1["Salary"] <= 20000).show()

+-------+---+----------+------+
|   Name|age|Experience|Salary|
+-------+---+----------+------+
|  Sunny| 29|         4| 20000|
|   Paul| 24|         3| 20000|
| Harsha| 21|         1| 15000|
|Shubham| 23|         2| 18000|
+-------+---+----------+------+



In [4]:
# Use the `select()` method to return specific columns.
df1.filter("Salary <= 20000").select(["Name", "Age"]).show()

+-------+---+
|   Name|Age|
+-------+---+
|  Sunny| 29|
|   Paul| 24|
| Harsha| 21|
|Shubham| 23|
+-------+---+



In [5]:
# Use the and (&) operator to retrieve data based on multiple conditions.
df1.filter(
    (df1["Salary"] < 20000) & 
    (df1["Salary"] > 15000)
).show()

+-------+---+----------+------+
|   Name|age|Experience|Salary|
+-------+---+----------+------+
|Shubham| 23|         2| 18000|
+-------+---+----------+------+



In [6]:
# Use the or (|) operator to retrieve data based on multiple conditions.
df1.filter(
    (df1["Salary"] > 25000) | 
    (df1["Salary"] < 20000)
).show()

+-------+---+----------+------+
|   Name|age|Experience|Salary|
+-------+---+----------+------+
|  Krish| 31|        10| 30000|
| Harsha| 21|         1| 15000|
|Shubham| 23|         2| 18000|
+-------+---+----------+------+



In [7]:
df1.filter(~(df1["Salary"] <= 20000)).show()

# NOTE: This syntax will throw an error with the not (~) operator: 
# df1.filter(~("Salary <= 20000")).show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
+---------+---+----------+------+

