* Collect
* Sample
* fill and fillna()
* Pivot() Unpivot()
* Create PySpark ArrayType
* map()
* explode()
* split()
* Maptype()
* when()
* lit()
* split()
* concat_ws
* substring()
*  regexp_replace()

In [1]:
import findspark
findspark.init()
import pandas as pd
import pyspark
from pyspark.sql import SparkSession

from pyspark.sql.types import StructField, StructType, StringType, MapType
from pyspark.sql.functions import col,lit,udf,expr,map_keys,explode,map_values,when,concat,split,concat_ws,substring,translate

In [2]:
spark=SparkSession.builder.appName("Train_Hard").getOrCreate()

# Collect

* PySpark RDD/DataFrame collect() is an action operation that is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group() e.t.c. Retrieving larger datasets results in OutOfMemory error.

In [6]:
dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.show()

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|  Finance|     10|
|Marketing|     20|
|    Sales|     30|
|       IT|     40|
+---------+-------+



In [8]:
dta_collect=deptDF.collect()
print(dta_collect)

[Row(dept_name='Finance', dept_id=10), Row(dept_name='Marketing', dept_id=20), Row(dept_name='Sales', dept_id=30), Row(dept_name='IT', dept_id=40)]


* Note that collect() is an action hence it does not return a DataFrame instead, it returns data in an Array to the driver. Once the data is in an array, you can use python for loop to process it further.

In [9]:
for i in dta_collect:
    print(i["dept_name"])

Finance
Marketing
Sales
IT


## Sample
* PySpark sampling (pyspark.sql.DataFrame.sample()) is a mechanism to get random sample records from the dataset, this is helpful when you have a larger dataset and wanted to analyze/test a subset of the data for example 10% of the original file.

In [6]:
df=spark.range(100)
df.show(3)

+---+
| id|
+---+
|  0|
|  1|
|  2|
+---+
only showing top 3 rows



#### 0.1 returns 10% of the rows. However, this does not guarantee it returns the exact 10% of the records.

In [11]:
df.sample(.1).count()


10

In [12]:
print(df.sample(.1).collect())

[Row(id=2), Row(id=6), Row(id=27), Row(id=38), Row(id=47), Row(id=53), Row(id=61), Row(id=67), Row(id=75), Row(id=77), Row(id=82), Row(id=94)]


#### Using seed to reproduce the same Samples in PySpark

In [13]:
print(df.sample(.1,seed=12).collect())

[Row(id=4), Row(id=13), Row(id=16), Row(id=22), Row(id=32), Row(id=39), Row(id=93), Row(id=94)]


In [14]:
print(df.sample(.1,seed=12).collect())

[Row(id=4), Row(id=13), Row(id=16), Row(id=22), Row(id=32), Row(id=39), Row(id=93), Row(id=94)]


#### Sample withReplacement (May contain duplicates)

In [16]:
print(df.sample(True,.1,seed=12).collect())

[Row(id=5), Row(id=15), Row(id=16), Row(id=43), Row(id=62), Row(id=64), Row(id=64), Row(id=65), Row(id=74), Row(id=76), Row(id=80)]


# Fill() and Fillna()

In [20]:
df=spark.read.csv("test1.csv",header=True)
df.show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+



In [24]:
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- Experience: string (nullable = true)
 |-- Salary: string (nullable = true)



In [49]:
df=df.withColumn("age",(col("age").cast("int")))\
  .withColumn("Salary",(col("Salary").cast("int")))
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: string (nullable = true)
 |-- Salary: integer (nullable = true)



### Replace NULL/None Values with Zero (0)
* it will change the values of only integers

In [50]:
df.na.fill(value=0).show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
|   Mahesh|  0|      null| 40000|
|     null| 34|        10| 38000|
|     null| 36|      null|     0|
+---------+---+----------+------+



#### Replace null values with " "

In [51]:
df.na.fill(" ").show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|          | 40000|
|         |  34|        10| 38000|
|         |  36|          |  null|
+---------+----+----------+------+



* note it is aaplied only on strings

#### Filling only a particular colum

In [52]:
df.na.fill("missing",subset=["Name"]).show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|  missing|  34|        10| 38000|
|  missing|  36|      null|  null|
+---------+----+----------+------+



# fillna

In [53]:
df.fillna(0).show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
|   Mahesh|  0|      null| 40000|
|     null| 34|        10| 38000|
|     null| 36|      null|     0|
+---------+---+----------+------+



In [55]:
df.fillna("missing",subset=["Experience"]).show()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|   missing| 40000|
|     null|  34|        10| 38000|
|     null|  36|   missing|  null|
+---------+----+----------+------+



### Pivot
* PySpark pivot() function is used to rotate/transpose the data from one column into multiple Dataframe columns and back using unpivot().
* it is an aggregation where one of the grouping columns values is transposed into individual columns with distinct data

In [56]:
data = [("Banana",1000,"USA"), ("Carrots",1500,"USA"), ("Beans",1600,"USA"), \
      ("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"), \
      ("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"), \
      ("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")]

columns= ["Product","Amount","Country"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

root
 |-- Product: string (nullable = true)
 |-- Amount: long (nullable = true)
 |-- Country: string (nullable = true)

+-------+------+-------+
|Product|Amount|Country|
+-------+------+-------+
|Banana |1000  |USA    |
|Carrots|1500  |USA    |
|Beans  |1600  |USA    |
|Orange |2000  |USA    |
|Orange |2000  |USA    |
|Banana |400   |China  |
|Carrots|1200  |China  |
|Beans  |1500  |China  |
|Orange |4000  |China  |
|Banana |2000  |Canada |
|Carrots|2000  |Canada |
|Beans  |2000  |Mexico |
+-------+------+-------+



In [59]:
df.groupBy(["Product","country"]).sum("Amount").show()

+-------+-------+-----------+
|Product|country|sum(Amount)|
+-------+-------+-----------+
|Carrots|    USA|       1500|
| Banana|    USA|       1000|
|  Beans|    USA|       1600|
| Banana|  China|        400|
| Orange|    USA|       4000|
| Orange|  China|       4000|
|Carrots|  China|       1200|
|  Beans|  China|       1500|
|Carrots| Canada|       2000|
|  Beans| Mexico|       2000|
| Banana| Canada|       2000|
+-------+-------+-----------+



In [57]:
pi=df.groupBy("Product").pivot("country").sum("Amount")
pi.show()

+-------+------+-----+------+----+
|Product|Canada|China|Mexico| USA|
+-------+------+-----+------+----+
| Orange|  null| 4000|  null|4000|
|  Beans|  null| 1500|  2000|1600|
| Banana|  2000|  400|  null|1000|
|Carrots|  2000| 1200|  null|1500|
+-------+------+-----+------+----+



In [61]:
## rearranging columns
countries = ["USA","China","Canada","Mexico"]
pivotDF =df.groupBy("Product").pivot("country",countries).sum("Amount")
pivotDF.show()

+-------+----+-----+------+------+
|Product| USA|China|Canada|Mexico|
+-------+----+-----+------+------+
| Orange|4000| 4000|  null|  null|
|  Beans|1600| 1500|  null|  2000|
| Banana|1000|  400|  2000|  null|
|Carrots|1500| 1200|  2000|  null|
+-------+----+-----+------+------+



## Unpivot()

In [69]:
unpivotExpr = "stack(3, 'Canada', Canada, 'China', China, 'Mexico', Mexico) as (Country,Total)"
unPivotDF = pivotDF.select("Product", expr(unpivotExpr)) \
    .where("Total is not null")
unPivotDF.show(truncate=False)
unPivotDF.show()

+-------+-------+-----+
|Product|Country|Total|
+-------+-------+-----+
|Orange |China  |4000 |
|Beans  |China  |1500 |
|Beans  |Mexico |2000 |
|Banana |Canada |2000 |
|Banana |China  |400  |
|Carrots|Canada |2000 |
|Carrots|China  |1200 |
+-------+-------+-----+

+-------+-------+-----+
|Product|Country|Total|
+-------+-------+-----+
| Orange|  China| 4000|
|  Beans|  China| 1500|
|  Beans| Mexico| 2000|
| Banana| Canada| 2000|
| Banana|  China|  400|
|Carrots| Canada| 2000|
|Carrots|  China| 1200|
+-------+-------+-----+



## Create PySpark ArrayType

In [3]:

data = [
 ("James,,Smith",["Java","Scala","C++"],["Spark","Java"],"OH","CA"),
 ("Michael,Rose,",["Spark","Java","C++"],["Spark","Java"],"NY","NJ"),
 ("Robert,,Williams",["CSharp","VB"],["Spark","Python"],"UT","NV")
]

from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([ 
    StructField("name",StringType(),True), 
    StructField("languagesAtSchool",ArrayType(StringType()),True), 
    StructField("languagesAtWork",ArrayType(StringType()),True), 
    StructField("currentState", StringType(), True), 
    StructField("previousState", StringType(), True)
  ])

df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- languagesAtSchool: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- languagesAtWork: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentState: string (nullable = true)
 |-- previousState: string (nullable = true)

+----------------+------------------+---------------+------------+-------------+
|            name| languagesAtSchool|languagesAtWork|currentState|previousState|
+----------------+------------------+---------------+------------+-------------+
|    James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|
|   Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|
|Robert,,Williams|      [CSharp, VB]|[Spark, Python]|          UT|           NV|
+----------------+------------------+---------------+------------+-------------+



## changing columns name

In [5]:
df.columns=["c1","c2","c3","c4","c5"]
df.show()

AttributeError: can't set attribute

### Using map()

In [8]:
#rdd=spark.sparkContext.parallelize(data)
cols=["c1","c2","c3","c4","c5"]
ndd=df.rdd.map(lambda x: (x[0],x[1],x[2],x[3],x[4]))
df2=ndd.toDF(cols)
df2.show()

+----------------+------------------+---------------+---+---+
|              c1|                c2|             c3| c4| c5|
+----------------+------------------+---------------+---+---+
|    James,,Smith|[Java, Scala, C++]|  [Spark, Java]| OH| CA|
|   Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]| NY| NJ|
|Robert,,Williams|      [CSharp, VB]|[Spark, Python]| UT| NV|
+----------------+------------------+---------------+---+---+



# Explode

In [9]:
from pyspark.sql.functions import explode
df.select(df.name,explode(df.languagesAtSchool)).show()

+----------------+------+
|            name|   col|
+----------------+------+
|    James,,Smith|  Java|
|    James,,Smith| Scala|
|    James,,Smith|   C++|
|   Michael,Rose,| Spark|
|   Michael,Rose,|  Java|
|   Michael,Rose,|   C++|
|Robert,,Williams|CSharp|
|Robert,,Williams|    VB|
+----------------+------+



# split()

In [10]:
from pyspark.sql.functions import split
df.select(split(df.name,",").alias("nameAsArray")).show()

+--------------------+
|         nameAsArray|
+--------------------+
|    [James, , Smith]|
|   [Michael, Rose, ]|
|[Robert, , Williams]|
+--------------------+



# array()
* Use array() function to create a new array column by merging the data from multiple columns.

In [14]:
from pyspark.sql.functions import array
df.select(df.name,array("currentState","previousState").alias("New_col")).show()

+----------------+--------+
|            name| New_col|
+----------------+--------+
|    James,,Smith|[OH, CA]|
|   Michael,Rose,|[NY, NJ]|
|Robert,,Williams|[UT, NV]|
+----------------+--------+



## array_contains() 
* sql function is used to check if array column contains a value. Returns null if the array is null, true if the array contains the value, and false otherwise.

In [15]:

from pyspark.sql.functions import array_contains
df.select(df.name,array_contains(df.languagesAtSchool,"Java")
    .alias("array_contains")).show()

+----------------+--------------+
|            name|array_contains|
+----------------+--------------+
|    James,,Smith|          true|
|   Michael,Rose,|          true|
|Robert,,Williams|         false|
+----------------+--------------+



## MapType()
* pySpark MapType (also called map type) is a data type to represent Python Dictionary (dict) to store key-value pair, 

* The First param keyType is used to specify the type of the key in the map.
* The Second param valueType is used to specify the type of the value in the map.
* Third parm valueContainsNull is an optional boolean type that is used to specify if the value of the second param can accept Null/None values.

In [8]:
schema = StructType([
    StructField('name', StringType(), True),
    StructField('properties', MapType(StringType(),StringType()),True)
])
dd = [('James',{'hair':'black','eye':'brown'}),
        ('Michael',{'hair':'brown','eye':None}),
        ('Robert',{'hair':'red','eye':'black'}),
        ('Washington',{'hair':'grey','eye':'grey'}),
        ('Jefferson',{'hair':'brown','eye':''})]
df=spark.createDataFrame(dd,schema)
df.show(truncate=False)

+----------+-----------------------------+
|name      |properties                   |
+----------+-----------------------------+
|James     |{eye -> brown, hair -> black}|
|Michael   |{eye -> null, hair -> brown} |
|Robert    |{eye -> black, hair -> red}  |
|Washington|{eye -> grey, hair -> grey}  |
|Jefferson |{eye -> , hair -> brown}     |
+----------+-----------------------------+



## Access PySpark MapType Elements

In [11]:
df.withColumn("eye",df.properties.getItem("eye"))\
.withColumn("hair",df.properties.getItem("hair")).show()

+----------+--------------------+-----+-----+
|      name|          properties|  eye| hair|
+----------+--------------------+-----+-----+
|     James|{eye -> brown, ha...|brown|black|
|   Michael|{eye -> null, hai...| null|brown|
|    Robert|{eye -> black, ha...|black|  red|
|Washington|{eye -> grey, hai...| grey| grey|
| Jefferson|{eye -> , hair ->...|     |brown|
+----------+--------------------+-----+-----+



In [12]:
df.withColumn("eye",df.properties.getItem("eye"))\
.withColumn("hair",df.properties.getItem("hair"))\
.drop("properties").show()

+----------+-----+-----+
|      name|  eye| hair|
+----------+-----+-----+
|     James|brown|black|
|   Michael| null|brown|
|    Robert|black|  red|
|Washington| grey| grey|
| Jefferson|     |brown|
+----------+-----+-----+



#### using explode()

In [14]:
df.select(df.name,explode(df.properties)).show()

+----------+----+-----+
|      name| key|value|
+----------+----+-----+
|     James| eye|brown|
|     James|hair|black|
|   Michael| eye| null|
|   Michael|hair|brown|
|    Robert| eye|black|
|    Robert|hair|  red|
|Washington| eye| grey|
|Washington|hair| grey|
| Jefferson| eye|     |
| Jefferson|hair|brown|
+----------+----+-----+



### map_keys() 

In [17]:
df.select(df.name,map_keys(df.properties)).show()

+----------+--------------------+
|      name|map_keys(properties)|
+----------+--------------------+
|     James|         [eye, hair]|
|   Michael|         [eye, hair]|
|    Robert|         [eye, hair]|
|Washington|         [eye, hair]|
| Jefferson|         [eye, hair]|
+----------+--------------------+



### map_values()

In [20]:
df.select(df.name,map_values(df.properties).alias("New_Col")).show()

+----------+--------------+
|      name|       New_Col|
+----------+--------------+
|     James|[brown, black]|
|   Michael| [null, brown]|
|    Robert|  [black, red]|
|Washington|  [grey, grey]|
| Jefferson|     [, brown]|
+----------+--------------+



# When

In [5]:
data = [("James","M",60000),("Michael","M",70000),
        ("Robert",None,400000),("Maria","F",500000),
        ("Jen","",None)]

columns = ["name","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()

+-------+------+------+
|   name|gender|salary|
+-------+------+------+
|  James|     M| 60000|
|Michael|     M| 70000|
| Robert|  null|400000|
|  Maria|     F|500000|
|    Jen|      |  null|
+-------+------+------+



In [8]:
df.withColumn("new_age",when(df.gender=="M","Male")
             .when(df.gender=="F","Female")
             .otherwise(df.gender)).show()

+-------+------+------+-------+
|   name|gender|salary|new_age|
+-------+------+------+-------+
|  James|     M| 60000|   Male|
|Michael|     M| 70000|   Male|
| Robert|  null|400000|   null|
|  Maria|     F|500000| Female|
|    Jen|      |  null|       |
+-------+------+------+-------+



In [11]:
df.withColumn("new_age",when(df.gender=="M","Male")
             .when(df.gender=="F","Female")
             .when(df.gender.isNull(),"missing")
             .otherwise(df.gender)).show()

+-------+------+------+-------+
|   name|gender|salary|new_age|
+-------+------+------+-------+
|  James|     M| 60000|   Male|
|Michael|     M| 70000|   Male|
| Robert|  null|400000|missing|
|  Maria|     F|500000| Female|
|    Jen|      |  null|       |
+-------+------+------+-------+



## Using case when

In [16]:
df.withColumn("new_gen",expr("CASE WHEN gender = 'M' THEN 'Male' " + 
               "WHEN gender = 'F' THEN 'Female' WHEN gender IS NULL THEN ''" +
               "ELSE gender END")).show()

+-------+------+------+-------+
|   name|gender|salary|new_gen|
+-------+------+------+-------+
|  James|     M| 60000|   Male|
|Michael|     M| 70000|   Male|
| Robert|  null|400000|       |
|  Maria|     F|500000| Female|
|    Jen|      |  null|       |
+-------+------+------+-------+



### Multiple Conditions using & and | operator

In [18]:
df.withColumn("new_column",when((df.gender=="M") & (df.name=="James"),"ladka")
             .when((df.name=="Maria")| (df.gender=="F"),"ldki")
             .otherwise("Baccha")).show()

+-------+------+------+----------+
|   name|gender|salary|new_column|
+-------+------+------+----------+
|  James|     M| 60000|     ladka|
|Michael|     M| 70000|    Baccha|
| Robert|  null|400000|    Baccha|
|  Maria|     F|500000|      ldki|
|    Jen|      |  null|    Baccha|
+-------+------+------+----------+



# PySpark expr()
* PySpark expr() is a SQL function to execute SQL-like expressions and to use an existing DataFrame column value as an expression argument to Pyspark built-in functions

#### Concatenate Columns using expr()

In [3]:
data=[("James","Bond"),("Scott","Varsa")] 
df=spark.createDataFrame(data).toDF("col1","col2") 
df.withColumn("Name",expr("col1 ||','||col2")).show()

+-----+-----+-----------+
| col1| col2|       Name|
+-----+-----+-----------+
|James| Bond| James,Bond|
|Scott|Varsa|Scott,Varsa|
+-----+-----+-----------+



In [18]:
df.withColumn("Name",concat(col("col1"),col("col2"))).show()

+-----+-----+----------+
| col1| col2|      Name|
+-----+-----+----------+
|James| Bond| JamesBond|
|Scott|Varsa|ScottVarsa|
+-----+-----+----------+



### Using an Existing Column Value for Expression

In [20]:
data=[("2019-01-23",1),("2019-06-24",2),("2019-09-20",3)] 
df=spark.createDataFrame(data).toDF("date","increment")
df.show()

+----------+---------+
|      date|increment|
+----------+---------+
|2019-01-23|        1|
|2019-06-24|        2|
|2019-09-20|        3|
+----------+---------+



In [21]:
df.select(col("*"),expr("add_months(date,increment)").alias("inc_data")).show()

+----------+---------+----------+
|      date|increment|  inc_data|
+----------+---------+----------+
|2019-01-23|        1|2019-02-23|
|2019-06-24|        2|2019-08-24|
|2019-09-20|        3|2019-12-20|
+----------+---------+----------+



#### Arithmetic operations

In [22]:
df.select(col("*"),expr("increment +5 as new_increment")).show()

+----------+---------+-------------+
|      date|increment|new_increment|
+----------+---------+-------------+
|2019-01-23|        1|            6|
|2019-06-24|        2|            7|
|2019-09-20|        3|            8|
+----------+---------+-------------+



###  Using Filter with expr()

In [23]:
data=[(100,2),(200,3000),(500,500)] 
df=spark.createDataFrame(data).toDF("col1","col2")
df.show()

+----+----+
|col1|col2|
+----+----+
| 100|   2|
| 200|3000|
| 500| 500|
+----+----+



In [24]:
df.filter(expr("col1==col2")).show()

+----+----+
|col1|col2|
+----+----+
| 500| 500|
+----+----+



# lit
* PySpark SQL functions lit() and typedLit() are used to add a new column to DataFrame by assigning a literal or constant value

In [37]:
data = [("111",50000),("222",60000),("333",40000)]
columns= ["EmpId","Salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()

+-----+------+
|EmpId|Salary|
+-----+------+
|  111| 50000|
|  222| 60000|
|  333| 40000|
+-----+------+



#### usage of lit() function

In [27]:
df.select(col("*"),lit(1).alias("constant")).show()

+-----+------+--------+
|EmpId|Salary|constant|
+-----+------+--------+
|  111| 50000|       1|
|  222| 60000|       1|
|  333| 40000|       1|
+-----+------+--------+



#### Q- if salary is ge 40000 and le 50000 then 1000 else 200

In [41]:
#df.withColumn("new_val",when(col("Salary")>=4000 & col("Salary")<=5000,lit("100")).otherwise(lit("200")))
df.filter((col("Salary") >=40000) &( col("Salary") <= 50000)).show()

+-----+------+
|EmpId|Salary|
+-----+------+
|  111| 50000|
|  333| 40000|
+-----+------+



In [43]:
df.withColumn("new_val",when((col("Salary") >=40000) &( col("Salary") <= 50000),lit("100")).otherwise(lit("200"))).show()

+-----+------+-------+
|EmpId|Salary|new_val|
+-----+------+-------+
|  111| 50000|    100|
|  222| 60000|    200|
|  333| 40000|    100|
+-----+------+-------+



# Split()
* PySpark SQL provides split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame.

In [44]:
data = [("James, A, Smith","2018","M",3000),
            ("Michael, Rose, Jones","2010","M",4000),
            ("Robert,K,Williams","2010","M",4000),
            ("Maria,Anne,Jones","2005","F",4000),
            ("Jen,Mary,Brown","2010","",-1)
            ]

columns=["name","dob_year","gender","salary"]
df=spark.createDataFrame(data,columns)
df.show()

+--------------------+--------+------+------+
|                name|dob_year|gender|salary|
+--------------------+--------+------+------+
|     James, A, Smith|    2018|     M|  3000|
|Michael, Rose, Jones|    2010|     M|  4000|
|   Robert,K,Williams|    2010|     M|  4000|
|    Maria,Anne,Jones|    2005|     F|  4000|
|      Jen,Mary,Brown|    2010|      |    -1|
+--------------------+--------+------+------+



In [50]:
df.select(split(col("name"),",").alias("Name_array")).show(truncate=False)

+------------------------+
|Name_array              |
+------------------------+
|[James,  A,  Smith]     |
|[Michael,  Rose,  Jones]|
|[Robert, K, Williams]   |
|[Maria, Anne, Jones]    |
|[Jen, Mary, Brown]      |
+------------------------+



In [51]:
df.select(split(col("name"),",")[0].alias("Name_array")).show(truncate=False)

+----------+
|Name_array|
+----------+
|James     |
|Michael   |
|Robert    |
|Maria     |
|Jen       |
+----------+



# concat_ws()
* takes delimiter of your choice as a first argument and array column (type Column) as the second argument.

In [56]:

columns = ["name","languagesAtSchool","currentState"]
data = [("James,,Smith",["Java","Scala","C++"],"CA"), \
    ("Michael,Rose,",["Spark","Java","C++"],"NJ"), \
    ("Robert,,Williams",["CSharp","VB"],"NV")]

df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)

+----------------+------------------+------------+
|name            |languagesAtSchool |currentState|
+----------------+------------------+------------+
|James,,Smith    |[Java, Scala, C++]|CA          |
|Michael,Rose,   |[Spark, Java, C++]|NJ          |
|Robert,,Williams|[CSharp, VB]      |NV          |
+----------------+------------------+------------+



### In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() 

In [64]:
df.withColumn("languagesAtSchool",concat_ws(",",col("languagesAtSchool"))).show()

+----------------+-----------------+------------+
|            name|languagesAtSchool|currentState|
+----------------+-----------------+------------+
|    James,,Smith|   Java,Scala,C++|          CA|
|   Michael,Rose,|   Spark,Java,C++|          NJ|
|Robert,,Williams|        CSharp,VB|          NV|
+----------------+-----------------+------------+



# Substring()
* substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.

In [65]:

data = [(1,"20200828"),(2,"20180525")]
columns=["id","date"]
df=spark.createDataFrame(data,columns)
df.show()

+---+--------+
| id|    date|
+---+--------+
|  1|20200828|
|  2|20180525|
+---+--------+



In [79]:
df.withColumn("year",substring("date",1,4)) \
              .withColumn("month",substring("date",5,2))\
              .withColumn("day",substring(col("date"),7,2)).show()

+---+--------+----+-----+---+
| id|    date|year|month|day|
+---+--------+----+-----+---+
|  1|20200828|2020|   08| 28|
|  2|20180525|2018|   05| 25|
+---+--------+----+-----+---+



# Translate()
* By using translate() string function you can replace character by character of DataFrame column value.

In [9]:
address = [(1,"14851 Jeffrey Rd","DE"),
    (2,"43421 Margarita St","NY"),
    (3,"13111 Siemon Ave","CA")]
df =spark.createDataFrame(address,["id","address","state"])
df.show()

+---+------------------+-----+
| id|           address|state|
+---+------------------+-----+
|  1|  14851 Jeffrey Rd|   DE|
|  2|43421 Margarita St|   NY|
|  3|  13111 Siemon Ave|   CA|
+---+------------------+-----+



In [7]:
df.withColumn("add",translate(col("address"),"123","ABC")).show()

+---+------------------+-----+------------------+
| id|           address|state|               add|
+---+------------------+-----+------------------+
|  1|  14851 Jeffrey Rd|   DE|  A485A Jeffrey Rd|
|  2|43421 Margarita St|   NY|4C4BA Margarita St|
|  3|  13111 Siemon Ave|   CA|  ACAAA Siemon Ave|
+---+------------------+-----+------------------+



# regexp_replace()
* you can replace a column value with a string for another string/substring

In [10]:
from pyspark.sql.functions import regexp_replace
df.withColumn('address', regexp_replace("address","Rd","Road")).show()

+---+------------------+-----+
| id|           address|state|
+---+------------------+-----+
|  1|14851 Jeffrey Road|   DE|
|  2|43421 Margarita St|   NY|
|  3|  13111 Siemon Ave|   CA|
+---+------------------+-----+



In [12]:
df.withColumn('new_add', 
              when(df.address.endswith("Rd"),regexp_replace("address","Rd","Road")) \
             .when(df.address.endswith("St"),regexp_replace("address","St","Street"))\
             .when(df.address.endswith("Ave"),regexp_replace("address","Ave","Avenue"))).show(truncate=False)

+---+------------------+-----+----------------------+
|id |address           |state|new_add               |
+---+------------------+-----+----------------------+
|1  |14851 Jeffrey Rd  |DE   |14851 Jeffrey Road    |
|2  |43421 Margarita St|NY   |43421 Margarita Street|
|3  |13111 Siemon Ave  |CA   |13111 Siemon Avenue   |
+---+------------------+-----+----------------------+



# translate()
* string function you can replace character by character of DataFrame column value.
* In the below example, every character of 1 is replaced with A, 2 replaced with B, and 3 replaced with C on the address column.

In [13]:
from pyspark.sql.functions import translate
df.withColumn('adds', translate("address","123","ABC")).show()

+---+------------------+-----+------------------+
| id|           address|state|              adds|
+---+------------------+-----+------------------+
|  1|  14851 Jeffrey Rd|   DE|  A485A Jeffrey Rd|
|  2|43421 Margarita St|   NY|4C4BA Margarita St|
|  3|  13111 Siemon Ave|   CA|  ACAAA Siemon Ave|
+---+------------------+-----+------------------+



## Replace Column with Another Column Value

In [14]:
df = spark.createDataFrame(
   [("ABCDE_XYZ", "XYZ","FGH")], 
    ("col1", "col2","col3"))
df.show()

+---------+----+----+
|     col1|col2|col3|
+---------+----+----+
|ABCDE_XYZ| XYZ| FGH|
+---------+----+----+



In [16]:
df.withColumn("col4",regexp_replace("col1","col2","col3")).show()

+---------+----+----+---------+
|     col1|col2|col3|     col4|
+---------+----+----+---------+
|ABCDE_XYZ| XYZ| FGH|ABCDE_XYZ|
+---------+----+----+---------+



In [20]:
df.withColumn("col4",expr("regexp_replace(col1,col2,col3)")).show()

+---------+----+----+---------+
|     col1|col2|col3|     col4|
+---------+----+----+---------+
|ABCDE_XYZ| XYZ| FGH|ABCDE_FGH|
+---------+----+----+---------+



In [15]:
df.withColumn("col4",regexp_replace("col1","XYZ","FGH")).show()

+---------+----+----+---------+
|     col1|col2|col3|     col4|
+---------+----+----+---------+
|ABCDE_XYZ| XYZ| FGH|ABCDE_FGH|
+---------+----+----+---------+



## Overlay()

In [3]:
df = spark.createDataFrame([("ABCDE_XYZ", "FGH")], ("col1", "col2"))
df.show()

+---------+----+
|     col1|col2|
+---------+----+
|ABCDE_XYZ| FGH|
+---------+----+



In [7]:
from pyspark.sql.functions import overlay
df.select(overlay("col1", "col2", 7).alias("overlayed")).show()

+---------+
|overlayed|
+---------+
|ABCDE_FGH|
+---------+



In [8]:
df.select(overlay("col1", "col2", 8).alias("overlayed")).show()

+----------+
| overlayed|
+----------+
|ABCDE_XFGH|
+----------+



In [9]:
df.select(overlay("col1", "col2", 9).alias("overlayed")).show()

+-----------+
|  overlayed|
+-----------+
|ABCDE_XYFGH|
+-----------+



In [10]:
df.select(overlay("col1", "col2", 10).alias("overlayed")).show()

+------------+
|   overlayed|
+------------+
|ABCDE_XYZFGH|
+------------+



In [11]:
df.select(overlay("col1", "col2", 12).alias("overlayed")).show()

+------------+
|   overlayed|
+------------+
|ABCDE_XYZFGH|
+------------+

