<img src="https://drive.google.com/uc?id=1MTwHayokmQgl6cdICl4OHHYSWUQnQzOO" alt="drawing" style="width:500px;"/>

In [0]:
####### Imports ######
from pyspark.sql.functions import *
from pyspark.sql.types import *

####### Reading Data #######
employee_df  = spark.read.format("csv")\
                           .option("header","true")\
                           .option("inferSchema","true")\
                           .option("mode","PERMISSIVE")\
                           .load('/FileStore/tables/employee_details.csv')
####### Printing Schema #######
employee_df.printSchema()

####### Displaying the dataframe #######
employee_df.show(truncate=False)

####### Creating Temporary View #######
employee_df.createOrReplaceTempView("employee_tbl")


root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)

+---+--------+---+------+------------+--------+
|id |name    |age|salary|address     |nominee |
+---+--------+---+------+------------+--------+
|1  |Soumya  |23 |15000 |Odisha      |nominee1|
|2  |Jyotsna |23 |19000 |Mumbai      |nominee2|
|3  |Pratisha|17 |20000 |Kolkata     |India   |
|4  |Pritam  |22 |100000|Uttarpradesh|India   |
|5  |Vikash  |31 |30000 |null        |nominee5|
+---+--------+---+------+------------+--------+



In [0]:
# Column aliasing/renaming using col and alias method
employee_df.select(col("id").alias("employee_id"),"name","age").show()

+-----------+--------+---+
|employee_id|    name|age|
+-----------+--------+---+
|          1|  Soumya| 23|
|          2| Jyotsna| 23|
|          3|Pratisha| 17|
|          4|  Pritam| 22|
|          5|  Vikash| 31|
+-----------+--------+---+



In [0]:
# Finding employee whose salary is more than 20,000
# Both methods are equivalent and serve the same purpose, with no inherent differences between them.
# Whether you prefer a SQL-like syntax or are more comfortable with Scala, the choice between 
# filter and where largely depends on personal preference. 
# If you are accustomed to SQL,using where might align better with your familiarity. On the other hand,
# if you are more inclined towards Scala, utilizing filter might feel more natural.
employee_df.filter(col("salary")>20000).show()

+---+------+---+------+------------+--------+
| id|  name|age|salary|     address| nominee|
+---+------+---+------+------------+--------+
|  4|Pritam| 22|100000|Uttarpradesh|   India|
|  5|Vikash| 31| 30000|        null|nominee5|
+---+------+---+------+------------+--------+



In [0]:
# We can also do the same using where
employee_df.where(col("salary")>20000).show()

+---+------+---+------+------------+--------+
| id|  name|age|salary|     address| nominee|
+---+------+---+------+------------+--------+
|  4|Pritam| 22|100000|Uttarpradesh|   India|
|  5|Vikash| 31| 30000|        null|nominee5|
+---+------+---+------+------------+--------+



Applying various conditions and merging them through logical operators.

In [0]:
# Find the employees who earn more than 20000 but age is less than 30
employee_df.where((col("salary")>20000) & (col("age") < 30)).show()

+---+------+---+------+------------+-------+
| id|  name|age|salary|     address|nominee|
+---+------+---+------+------------+-------+
|  4|Pritam| 22|100000|Uttarpradesh|  India|
+---+------+---+------+------------+-------+



##### literals
A literal (also known as a constant) represents a fixed data value.<br> 
lit() function is used to add constant or literal value as a new column to the DataFrame

In [0]:
employee_df.select("*",lit("Hindu").alias("Religion")).show()

+---+--------+---+------+------------+--------+--------+
| id|    name|age|salary|     address| nominee|Religion|
+---+--------+---+------+------------+--------+--------+
|  1|  Soumya| 23| 15000|      Odisha|nominee1|   Hindu|
|  2| Jyotsna| 23| 19000|      Mumbai|nominee2|   Hindu|
|  3|Pratisha| 17| 20000|     Kolkata|   India|   Hindu|
|  4|  Pritam| 22|100000|Uttarpradesh|   India|   Hindu|
|  5|  Vikash| 31| 30000|        null|nominee5|   Hindu|
+---+--------+---+------+------------+--------+--------+



##### withColumn()

The withColumn() function in PySpark is used to add, replace, or update columns(include changing the data type)<br>
in a DataFrame. It returns a new DataFrame with the updated column.

The syntax for the withColumn() function is as follows:<br>
**df = df.withColumn(new_column_name/existing_column, expression)**

The new_column_name/existing_column is the name of the new column/existing column respectively, and the<br>
expression is the expression that will be used to populate the new column. The expression can be a constant<br>
value, a PySpark column, or a PySpark expression.

*Remember : It takes only two argument as you can see in syntax.*

In [0]:
#Adding new column using withColumn()
employee_df.withColumn("last_name",lit("Singh")).show()

+---+--------+---+------+------------+--------+---------+
| id|    name|age|salary|     address| nominee|last_name|
+---+--------+---+------+------------+--------+---------+
|  1|  Soumya| 23| 15000|      Odisha|nominee1|    Singh|
|  2| Jyotsna| 23| 19000|      Mumbai|nominee2|    Singh|
|  3|Pratisha| 17| 20000|     Kolkata|   India|    Singh|
|  4|  Pritam| 22|100000|Uttarpradesh|   India|    Singh|
|  5|  Vikash| 31| 30000|        null|nominee5|    Singh|
+---+--------+---+------+------------+--------+---------+



##### withColumnRenamed()
The withColumnRenamed() method in PySpark is used to rename an existing column in a DataFrame

In [0]:
# Changing column name using withColumnRenamed
new_employee_df = employee_df.withColumnRenamed("id","employee_id")

# Displaying the new employee dataframe with changed name
new_employee_df.show()

+-----------+--------+---+------+------------+--------+
|employee_id|    name|age|salary|     address| nominee|
+-----------+--------+---+------+------------+--------+
|          1|  Soumya| 23| 15000|      Odisha|nominee1|
|          2| Jyotsna| 23| 19000|      Mumbai|nominee2|
|          3|Pratisha| 17| 20000|     Kolkata|   India|
|          4|  Pritam| 22|100000|Uttarpradesh|   India|
|          5|  Vikash| 31| 30000|        null|nominee5|
+-----------+--------+---+------+------------+--------+



In [0]:
employee_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)



In [0]:
# Casting column data type.
# As we know withColumn method is used to update the column so now we can use
# withColumn to change the data type of the column
employee_df.withColumn("id",col("salary").cast("string"))\
           .withColumn("salary",col("salary").cast("double"))\
           .printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)



In [0]:
# Let's do the same thing using expr
employee_df_expr = employee_df.withColumn("id", expr("CAST(id AS long)")) \
                         .withColumn("age", expr("CAST(age AS string)"))

employee_df_expr.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)



###### Removing the columns

In [0]:
# We can use drop method to remove the columns from the dataframe
employee_df.drop("address",col("nominee")).show()

+---+--------+---+------+
| id|    name|age|salary|
+---+--------+---+------+
|  1|  Soumya| 23| 15000|
|  2| Jyotsna| 23| 19000|
|  3|Pratisha| 17| 20000|
|  4|  Pritam| 22|100000|
|  5|  Vikash| 31| 30000|
+---+--------+---+------+



### SparkSQL
Let's do all above transformations in SparkSQL

In [0]:
# Displaying the employee details whose salary is more than 20000 and age is less than 30
spark.sql("""
SELECT * 
FROM employee_tbl
WHERE salary > 20000 and age < 30
""").show()

+---+------+---+------+------------+-------+
| id|  name|age|salary|     address|nominee|
+---+------+---+------+------------+-------+
|  4|Pritam| 22|100000|Uttarpradesh|  India|
+---+------+---+------+------------+-------+



In [0]:
# Let's add literals value in by creating new column
spark.sql("""
SELECT *,"kumar" as last_name 
FROM employee_tbl
""").show()


+---+--------+---+------+------------+--------+---------+
| id|    name|age|salary|     address| nominee|last_name|
+---+--------+---+------+------------+--------+---------+
|  1|  Soumya| 23| 15000|      Odisha|nominee1|    kumar|
|  2| Jyotsna| 23| 19000|      Mumbai|nominee2|    kumar|
|  3|Pratisha| 17| 20000|     Kolkata|   India|    kumar|
|  4|  Pritam| 22|100000|Uttarpradesh|   India|    kumar|
|  5|  Vikash| 31| 30000|        null|nominee5|    kumar|
+---+--------+---+------+------------+--------+---------+



In [0]:
#Adding the new columns
spark.sql("""
SELECT *,id + 5 as updated_id 
FROM employee_tbl
""").show()

+---+--------+---+------+------------+--------+----------+
| id|    name|age|salary|     address| nominee|updated_id|
+---+--------+---+------+------------+--------+----------+
|  1|  Soumya| 23| 15000|      Odisha|nominee1|         6|
|  2| Jyotsna| 23| 19000|      Mumbai|nominee2|         7|
|  3|Pratisha| 17| 20000|     Kolkata|   India|         8|
|  4|  Pritam| 22|100000|Uttarpradesh|   India|         9|
|  5|  Vikash| 31| 30000|        null|nominee5|        10|
+---+--------+---+------+------------+--------+----------+



In [0]:
# Renaming the column
spark.sql("""
SELECT id as employee_id,name,age,salary,address,nominee 
FROM employee_tbl
""").show()

+-----------+--------+---+------+------------+--------+
|employee_id|    name|age|salary|     address| nominee|
+-----------+--------+---+------+------------+--------+
|          1|  Soumya| 23| 15000|      Odisha|nominee1|
|          2| Jyotsna| 23| 19000|      Mumbai|nominee2|
|          3|Pratisha| 17| 20000|     Kolkata|   India|
|          4|  Pritam| 22|100000|Uttarpradesh|   India|
|          5|  Vikash| 31| 30000|        null|nominee5|
+-----------+--------+---+------+------------+--------+



In [0]:
# Cast the data type of the column
spark.sql("""
SELECT id,name,age,cast(salary as double),address,nominee
FROM employee_tbl
""").printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)



In [0]:
# For removing the column, just make it absence in select statement and it will be not present in the resulted dataframe
updated_emp_df = spark.sql("""
SELECT id,name,age,salary,address 
FROM employee_tbl
""")

# As we can see nominee column is not present in this new dataframe
updated_emp_df.show()

+---+--------+---+------+------------+
| id|    name|age|salary|     address|
+---+--------+---+------+------------+
|  1|  Soumya| 23| 15000|      Odisha|
|  2| Jyotsna| 23| 19000|      Mumbai|
|  3|Pratisha| 17| 20000|     Kolkata|
|  4|  Pritam| 22|100000|Uttarpradesh|
|  5|  Vikash| 31| 30000|        null|
+---+--------+---+------+------------+

