<a href="https://colab.research.google.com/github/Shivayogi-A/Pyspark_programming/blob/master/Misc_transformations_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [71]:
!apt-get update # Update apt-get repository.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java.
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz # Download Apache Sparks.
!tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
!pip install -q findspark # Install findspark. Adds PySpark to the System path during runtime.

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

!ls

# Initialize findspark
import findspark
findspark.init()

# Create a PySpark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [Connecting to archive.ubuntu.com] [1 InRelease 11.3 kB/129 kB 9%] [Waiting for headers] [Connect                                                                                                    Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Ign:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:12 http://archive.ubuntu.com/ubunt

In [72]:
data_list = [('Shiva', '17','11', '97'),
             ('Vaishnavi', '16','03','98'),
             ('Suraj', '15','06','1998'),
             ('Shubham', '19','08', '1998'),
             ('Shiva','17','11','1997')]

raw_df = spark.createDataFrame(data_list)

In [73]:
raw_df.show()

+---------+---+---+----+
|       _1| _2| _3|  _4|
+---------+---+---+----+
|    Shiva| 17| 11|  97|
|Vaishnavi| 16| 03|  98|
|    Suraj| 15| 06|1998|
|  Shubham| 19| 08|1998|
|    Shiva| 17| 11|1997|
+---------+---+---+----+



**pyspark.sql.DataFrame.toDF**

DataFrame.toDF(*cols: str)
It Returns a new DataFrame that with new specified column names

In [74]:
new_df =  raw_df.toDF('Name','Date','Month','Year').repartition(3)
new_df.printSchema()
new_df.show()

root
 |-- Name: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- Year: string (nullable = true)

+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|    Suraj|  15|   06|1998|
|Vaishnavi|  16|   03|  98|
|  Shubham|  19|   08|1998|
|    Shiva|  17|   11|  97|
|    Shiva|  17|   11|1997|
+---------+----+-----+----+



**CASE Statement**

Case clause uses a rule to return a specific result based on the specified condition, similar to if/else statements in other programming languages

In [75]:
from pyspark.sql.functions import *

df1 = new_df.withColumn("Year",expr("""
              CASE WHEN Year < 21 THEN Year+2000
              WHEN Year < 100 THEN Year+1900
              ELSE Year
              END"""))
df1.show()

+---------+----+-----+------+
|     Name|Date|Month|  Year|
+---------+----+-----+------+
|    Suraj|  15|   06|  1998|
|Vaishnavi|  16|   03|1998.0|
|  Shubham|  19|   08|  1998|
|    Shiva|  17|   11|1997.0|
|    Shiva|  17|   11|  1997|
+---------+----+-----+------+



The year column is string, but we are performing an arithmetic operation over it. Becuase of it spark will automatically promote it to a decimal data type and once the calucation is completed, again changes it to a string value as the column datatype is string. But the decimal points remain.

To fix it there are 2 methods,
1. **Inline casting**- cast the column to a relevant data type when you are performing an action.





In [76]:
from pyspark.sql.functions import *

df1 = new_df.withColumn("Year",expr("""
              CASE WHEN Year < 21 THEN cast(Year as int)+2000
              WHEN Year < 100 THEN cast(Year as int)+1900
              ELSE Year
              END"""))
df1.show()

+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|    Suraj|  15|   06|1998|
|Vaishnavi|  16|   03|1998|
|  Shubham|  19|   08|1998|
|    Shiva|  17|   11|1997|
|    Shiva|  17|   11|1997|
+---------+----+-----+----+



You can use the above method or you can cast the entire expression into integertype as shown below.

In [77]:
from pyspark.sql.types import IntegerType
df1  = new_df.withColumn("Year",expr("""
              CASE WHEN Year < 21 THEN Year+2000
              WHEN Year < 100 THEN Year+1900
              ELSE Year
              END""").cast(IntegerType()))
df2.show()

+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|    Suraj|  15|    6|1998|
|Vaishnavi|  16|    3|  98|
|  Shubham|  19|    8|1998|
|    Shiva|  17|   11|  97|
|    Shiva|  17|   11|1997|
+---------+----+-----+----+




2. **Change the schema**: We need to change the schema of the columns,or provide appropriate data types during the dataframe creation phase itself.
Changing the schema is a choice, we have to see if it is okay to change the schema and will not affect anything.

In the given scenario, changing the schema is perfectly fine.

In [78]:
df2 = new_df.withColumn('Date',col('Date').cast(IntegerType()))\
            .withColumn('Month',col('Month').cast(IntegerType()))\
            .withColumn('Year',col('Year').cast(IntegerType()))


df2.show()

+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|    Suraj|  15|    6|1998|
|Vaishnavi|  16|    3|  98|
|  Shubham|  19|    8|1998|
|    Shiva|  17|   11|  97|
|    Shiva|  17|   11|1997|
+---------+----+-----+----+



In [79]:
df3 = df2.withColumn("Year",expr("""
              CASE WHEN Year < 21 THEN Year+2000
              WHEN Year < 100 THEN Year+1900
              ELSE Year
              END"""))
df3.show()


+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|    Suraj|  15|    6|1998|
|Vaishnavi|  16|    3|1998|
|  Shubham|  19|    8|1998|
|    Shiva|  17|   11|1997|
|    Shiva|  17|   11|1997|
+---------+----+-----+----+



**Note: It is always recommended to cast the datatypes for columns explicitly to avoid abnormal behaviour of code.**

concatinating 3 cloumns(Date,Month,Year) to a single column to get date of birth.

We can use to_date finction to do this task

In [80]:
df4 = df3.withColumn('DOB', expr("to_date(concat(Date, '/',Month, '/', Year),'d/M/y')"))
df4.show()

+---------+----+-----+----+----------+
|     Name|Date|Month|Year|       DOB|
+---------+----+-----+----+----------+
|    Suraj|  15|    6|1998|1998-06-15|
|Vaishnavi|  16|    3|1998|1998-03-16|
|  Shubham|  19|    8|1998|1998-08-19|
|    Shiva|  17|   11|1997|1997-11-17|
|    Shiva|  17|   11|1997|1997-11-17|
+---------+----+-----+----+----------+



We got the date of birth, now the date, month and year columns become redundant. Hence we can remove them using drop function.


In [81]:
df5 = df4.drop('Date','Month','Year')
df5.show()

+---------+----------+
|     Name|       DOB|
+---------+----------+
|    Suraj|1998-06-15|
|Vaishnavi|1998-03-16|
|  Shubham|1998-08-19|
|    Shiva|1997-11-17|
|    Shiva|1997-11-17|
+---------+----------+



If you see the above results we have duplicates, we can use **dropDuplicates()** function to do the task.

**dropDuplicates()** : Return a new DataFrame with duplicate rows removed,
optionally only considering certain columns.

In [83]:
df6 = df5.dropDuplicates()
df6.show()

+---------+----------+
|     Name|       DOB|
+---------+----------+
|    Suraj|1998-06-15|
|    Shiva|1997-11-17|
|  Shubham|1998-08-19|
|Vaishnavi|1998-03-16|
+---------+----------+

