# PySpark Tutorial 3 
## PySpark DataFrame

 - Filter Operation
 - &,|,==
 - ~

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("dataframe").getOrCreate()

In [3]:
# Read the data
df = spark.read.csv("train.csv", header=True, inferSchema=True)
df.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

## Filter Operation

In [4]:
# Age of the Passenger less than or equal to 30
df.filter("Age<=30").show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|      Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----------+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25|       null|       S|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925|       null|       S|
|          8|       0|     3|Palsson, Master. ...|  male| 2.0|    3|    1|          349909| 21.075|       null|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|          347742|11.1333|       null|       S|
|         10|       1|     2|Nasser, Mrs. Nich...|female|14.0|    1|    0|          237736|30.0708|       null|       C|
|         11|       1|     3|San

As you can see that Age Column all the values under Age Column is less than or equal to 30.

In [5]:
df.filter("Age<=30").select("Sex", "Survived").show()

+------+--------+
|   Sex|Survived|
+------+--------+
|  male|       0|
|female|       1|
|  male|       0|
|female|       1|
|female|       1|
|female|       1|
|  male|       0|
|female|       0|
|  male|       0|
|female|       1|
|  male|       1|
|female|       0|
|  male|       0|
|  male|       0|
|  male|       0|
|female|       0|
|female|       1|
|female|       0|
|female|       1|
|female|       1|
+------+--------+
only showing top 20 rows



In [6]:
# Another way for filter
df.filter(df["Age"]<=30).show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|      Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----------+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25|       null|       S|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925|       null|       S|
|          8|       0|     3|Palsson, Master. ...|  male| 2.0|    3|    1|          349909| 21.075|       null|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|          347742|11.1333|       null|       S|
|         10|       1|     2|Nasser, Mrs. Nich...|female|14.0|    1|    0|          237736|30.0708|       null|       C|
|         11|       1|     3|San

In [8]:
# Take Multiple Condition (And Operation)
df.filter((df["Age"] <=30) & 
          (df["Survived"] == 1)).show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|      Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----------+--------+
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925|       null|       S|
|          9|       1|     3|Johnson, Mrs. Osc...|female|27.0|    0|    2|          347742|11.1333|       null|       S|
|         10|       1|     2|Nasser, Mrs. Nich...|female|14.0|    1|    0|          237736|30.0708|       null|       C|
|         11|       1|     3|Sandstrom, Miss. ...|female| 4.0|    1|    1|         PP 9549|   16.7|         G6|       S|
|         23|       1|     3|"McGowan, Miss. A...|female|15.0|    0|    0|          330923| 8.0292|       null|       Q|
|         24|       1|     1|Slo

It show Data according to when Age less than or equal to 30 and who is survived.

In [10]:
# Not Operation
df.filter(~(df["Age"]<=30)).show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|    Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------+-------+-----+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|  PC 17599|71.2833|  C85|       C|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|    113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|    373450|   8.05| null|       S|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|     17463|51.8625|  E46|       S|
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|    113783|  26.55| C103|       S|
|         14|       0|     3|Andersson, Mr. An...|  male|39.0|    1|    5|    347082| 31.275| null|       S|
|         16|      

Using Not Operation (~) it show data of Age Column not less than or equal to 30.