<a href="https://colab.research.google.com/github/Shivayogi-A/Pyspark_programming/blob/master/Misc_transformations_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!apt-get update # Update apt-get repository.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java.
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz # Download Apache Sparks.
!tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
!pip install -q findspark # Install findspark. Adds PySpark to the System path during runtime.

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

!ls

# Initialize findspark
import findspark
findspark.init()

# Create a PySpark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Ign:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Hit:7 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,208 kB]
Get:14 https://r2u.stat.illinois.ed

In [None]:
data_list = [('Shiva', '17','11', '97'),
             ('Vaishnavi', '16','03','98'),
             ('Suraj', '15','06','1998'),
             ('Shubham', '19','08', '1998')]

raw_df = spark.createDataFrame(data_list)

In [None]:
raw_df.show()

+---------+---+---+----+
|       _1| _2| _3|  _4|
+---------+---+---+----+
|    Shiva| 17| 11|  97|
|Vaishnavi| 16| 03|  98|
|    Suraj| 15| 06|1998|
|  Shubham| 19| 08|1998|
+---------+---+---+----+



**pyspark.sql.DataFrame.toDF**

DataFrame.toDF(*cols: str)
It Returns a new DataFrame that with new specified column names

In [None]:
new_df =  raw_df.toDF('Name','Date','Month','Year').repartition(3)
new_df.printSchema()
new_df.show()

root
 |-- Name: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- Year: string (nullable = true)

+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|Vaishnavi|  16|   03|  98|
|  Shubham|  19|   08|1998|
|    Shiva|  17|   11|  97|
|    Suraj|  15|   06|1998|
+---------+----+-----+----+



**CASE Statement**

Case clause uses a rule to return a specific result based on the specified condition, similar to if/else statements in other programming languages

In [None]:
from pyspark.sql.functions import *

df1 = new_df.withColumn("Year",expr("""
              CASE WHEN Year < 21 THEN Year+2000
              WHEN Year < 100 THEN Year+1900
              ELSE Year
              END"""))
df1.show()

+---------+----+-----+------+
|     Name|Date|Month|  Year|
+---------+----+-----+------+
|Vaishnavi|  16|   03|1998.0|
|  Shubham|  19|   08|  1998|
|    Shiva|  17|   11|1997.0|
|    Suraj|  15|   06|  1998|
+---------+----+-----+------+



The year column is string, but we are performing an arithmetic operation over it. Becuase of it spark will automatically promote it to a decimal data type and once the calucation is completed, again changes it to a string value as the column datatype is string. But the decimal points remain.

To fix it there are 2 methods,
1. **Inline casting**- cast the column to a relevant data type when you are performing an action.





In [None]:
from pyspark.sql.functions import *

df1 = new_df.withColumn("Year",expr("""
              CASE WHEN Year < 21 THEN cast(Year as int)+2000
              WHEN Year < 100 THEN cast(Year as int)+1900
              ELSE Year
              END"""))
df1.show()

+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|Vaishnavi|  16|   03|1998|
|  Shubham|  19|   08|1998|
|    Shiva|  17|   11|1997|
|    Suraj|  15|   06|1998|
+---------+----+-----+----+



You can use the above method or you can cast the entire expression into integertype as shown below.

In [None]:
from pyspark.sql.types import IntegerType
df1  = new_df.withColumn("Year",expr("""
              CASE WHEN Year < 21 THEN Year+2000
              WHEN Year < 100 THEN Year+1900
              ELSE Year
              END""").cast(IntegerType()))
df2.show()

+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|Vaishnavi|  16|   03|1998|
|  Shubham|  19|   08|1998|
|    Shiva|  17|   11|1997|
|    Suraj|  15|   06|1998|
+---------+----+-----+----+




2. **Change the schema**: We need to change the schema of the columns,or provide appropriate data types during the dataframe creation phase itself.
Changing the schema is a choice, we have to see if it is okay to change the schema and will not affect anything.

In the given scenario, changing the schema is perfectly fine.

In [None]:
df2 = new_df.withColumn('Date',col('Date').cast(IntegerType()))\
            .withColumn('Month',col('Month').cast(IntegerType()))\
            .withColumn('Year',col('Year').cast(IntegerType()))


df2.show()

+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|Vaishnavi|  16|    3|  98|
|  Shubham|  19|    8|1998|
|    Shiva|  17|   11|  97|
|    Suraj|  15|    6|1998|
+---------+----+-----+----+



In [None]:
df3 = df2.withColumn("Year",expr("""
              CASE WHEN Year < 21 THEN Year+2000
              WHEN Year < 100 THEN Year+1900
              ELSE Year
              END"""))
df3.show()


+---------+----+-----+----+
|     Name|Date|Month|Year|
+---------+----+-----+----+
|Vaishnavi|  16|    3|1998|
|  Shubham|  19|    8|1998|
|    Shiva|  17|   11|1997|
|    Suraj|  15|    6|1998|
+---------+----+-----+----+



**Note: It is always recommended to cast the datatypes for columns explicitly to avoid abnormal behaviour of code.**