## RDD VS DATAFRAME

### RDD
    UnStructured and Structured
    JAVA Serializer

### Dataframe aka Schema RDD
    Tabular Format
    SQL Supported
    Schema File Support
    Catalyst Optimizer
    Java Serilizer / Kryo Serializer

## RDD to Dataframe

In [1]:
from pyspark.sql import SparkSession

appName = "BDP RDD2DF"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [5]:
spark

sc = spark.sparkContext

In [6]:
sc

In [7]:
li = ["Bhagya", "Dinesh", "Gowri", "Karthik", "Mohamed", "Naveen", "Ramanadhan", "Singaravel"]

rdd = sc.parallelize(li)

In [8]:
type(rdd)

pyspark.rdd.RDD

In [9]:
rdd.collect()

                                                                                

['Bhagya',
 'Dinesh',
 'Gowri',
 'Karthik',
 'Mohamed',
 'Naveen',
 'Ramanadhan',
 'Singaravel']

In [10]:
rdd.count()

                                                                                

8

In [12]:
rdd1 = rdd.map(lambda x: (x[0], x))

In [13]:
rdd1.collect()

[('B', 'Bhagya'),
 ('D', 'Dinesh'),
 ('G', 'Gowri'),
 ('K', 'Karthik'),
 ('M', 'Mohamed'),
 ('N', 'Naveen'),
 ('R', 'Ramanadhan'),
 ('S', 'Singaravel')]

In [14]:
from pyspark.sql import Row

In [22]:
rdd2 = rdd.map(lambda x: Row(first=x[0], actual=x))

In [23]:
rdd2.collect()

[Row(first='B', actual='Bhagya'),
 Row(first='D', actual='Dinesh'),
 Row(first='G', actual='Gowri'),
 Row(first='K', actual='Karthik'),
 Row(first='M', actual='Mohamed'),
 Row(first='N', actual='Naveen'),
 Row(first='R', actual='Ramanadhan'),
 Row(first='S', actual='Singaravel')]

In [24]:
df = spark.createDataFrame(rdd2)

In [25]:
df.show()

[Stage 11:>                                                         (0 + 1) / 1]

+-----+----------+
|first|    actual|
+-----+----------+
|    B|    Bhagya|
|    D|    Dinesh|
|    G|     Gowri|
|    K|   Karthik|
|    M|   Mohamed|
|    N|    Naveen|
|    R|Ramanadhan|
|    S|Singaravel|
+-----+----------+



                                                                                

# Spark ORC Dataset

In [73]:
!hdfs dfs -mkdir -p /user/bigdatapedia/input/order/orc

In [74]:
!hdfs dfs -put /home/bigdatapedia/data/neworders.snappy.orc /user/bigdatapedia/input/order/orc/

In [75]:
!hdfs dfs -ls -h /user/bigdatapedia/input/order/orc/

Found 1 items
-rw-r--r--   3 bigdatapedia supergroup    181.5 K 2025-03-22 05:17 /user/bigdatapedia/input/order/orc/neworders.snappy.orc


DataFrame[order_id: int, order_date: timestamp, order_customer_id: int, order_status: string]

In [76]:
df_order = spark.read.orc("/user/bigdatapedia/input/order/orc")

In [77]:
df_order.show(5,0)

[Stage 36:>                                                         (0 + 1) / 1]

+--------+-------------------+-----------------+---------------+
|order_id|order_date         |order_customer_id|order_status   |
+--------+-------------------+-----------------+---------------+
|1       |2013-07-25 00:00:00|11599            |CLOSED         |
|2       |2013-07-25 00:00:00|256              |PENDING_PAYMENT|
|3       |2013-07-25 00:00:00|12111            |COMPLETE       |
|4       |2013-07-25 00:00:00|8827             |CLOSED         |
|5       |2013-07-25 00:00:00|11318            |COMPLETE       |
+--------+-------------------+-----------------+---------------+
only showing top 5 rows



                                                                                

# Spark Parquet Dataset

In [28]:
!hdfs dfs -mkdir -p /user/bigdatapedia/input/customer/parquet

In [29]:
!hdfs dfs -put /home/bigdatapedia/data/customer_parq.parquet /user/bigdatapedia/input/customer/parquet/

In [30]:
!hdfs dfs -ls -h /user/bigdatapedia/input/customer/parquet/

Found 1 items
-rw-r--r--   3 bigdatapedia supergroup    248.7 K 2025-03-22 04:38 /user/bigdatapedia/input/customer/parquet/customer_parq.parquet


In [31]:
df_cust = spark.read.parquet("/user/bigdatapedia/input/customer/parquet")

                                                                                

DataFrame[customer_id: int, customer_fname: string, customer_lname: string, customer_email: string, customer_password: string, customer_street: string, customer_city: string, customer_state: string, customer_zipcode: string]

In [32]:
df_cust.show()

[Stage 15:>                                                         (0 + 1) / 1]

+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
|          4|          Mary|         Jones|     XXXXXXXXX|        XXXXXXXXX|  8324 Little Common|   San Marcos|            CA|          

                                                                                

# DF Transformatation/Action

## Transformation
    Narrow Transformation (No Shuffling)
    Wide Transformation (Shuffling)

## Narrow Transformation

#### Filter

In [33]:
df_filter = df_cust.filter("customer_state = 'TX'")

In [34]:
df_filter.show(5,0)

[Stage 16:>                                                         (0 + 1) / 1]

+-----------+--------------+--------------+--------------+-----------------+--------------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|customer_street           |customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------------+-------------+--------------+----------------+
|1          |Richard       |Hernandez     |XXXXXXXXX     |XXXXXXXXX        |6303 Heather Plaza        |Brownsville  |TX            |78521           |
|12         |Christopher   |Smith         |XXXXXXXXX     |XXXXXXXXX        |5594 Jagged Embers By-pass|San Antonio  |TX            |78227           |
|29         |Mary          |Humphrey      |XXXXXXXXX     |XXXXXXXXX        |2469 Blue Brook Crossing  |Fort Worth   |TX            |76133           |
|82         |Jonathan      |Cook          |XXXXXXXXX     |XXXXXXXXX        |7885 Sleepy Cove        

                                                                                

#### withColumn

In [42]:
from pyspark.sql.functions import concat_ws

In [48]:
df_wc = df_cust.withColumn("cust_fullname", concat_ws(" ", df_cust["customer_fname"], df_cust["customer_lname"]))

In [49]:
df_wc.show(5,0)

+-----------+--------------+--------------+--------------+-----------------+-----------------------+-------------+--------------+----------------+-----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|customer_street        |customer_city|customer_state|customer_zipcode|cust_fullname    |
+-----------+--------------+--------------+--------------+-----------------+-----------------------+-------------+--------------+----------------+-----------------+
|1          |Richard       |Hernandez     |XXXXXXXXX     |XXXXXXXXX        |6303 Heather Plaza     |Brownsville  |TX            |78521           |Richard Hernandez|
|2          |Mary          |Barrett       |XXXXXXXXX     |XXXXXXXXX        |9526 Noble Embers Ridge|Littleton    |CO            |80126           |Mary Barrett     |
|3          |Ann           |Smith         |XXXXXXXXX     |XXXXXXXXX        |3422 Blue Pioneer Bend |Caguas       |PR            |00725           |Ann Smith        |
|4        

#### Select

In [50]:
df_select = df_cust.select("customer_id", "customer_fname", "customer_lname", "customer_city", "customer_state")

In [51]:
df_select.show(5,0)

+-----------+--------------+--------------+-------------+--------------+
|customer_id|customer_fname|customer_lname|customer_city|customer_state|
+-----------+--------------+--------------+-------------+--------------+
|1          |Richard       |Hernandez     |Brownsville  |TX            |
|2          |Mary          |Barrett       |Littleton    |CO            |
|3          |Ann           |Smith         |Caguas       |PR            |
|4          |Mary          |Jones         |San Marcos   |CA            |
|5          |Robert        |Hudson        |Caguas       |PR            |
+-----------+--------------+--------------+-------------+--------------+
only showing top 5 rows



#### withColumnRenamed

In [57]:
df_wcr = df_cust.withColumnRenamed("customer_fname", "first_name")

In [58]:
df_wcr.show(5,0)

+-----------+----------+--------------+--------------+-----------------+-----------------------+-------------+--------------+----------------+
|customer_id|first_name|customer_lname|customer_email|customer_password|customer_street        |customer_city|customer_state|customer_zipcode|
+-----------+----------+--------------+--------------+-----------------+-----------------------+-------------+--------------+----------------+
|1          |Richard   |Hernandez     |XXXXXXXXX     |XXXXXXXXX        |6303 Heather Plaza     |Brownsville  |TX            |78521           |
|2          |Mary      |Barrett       |XXXXXXXXX     |XXXXXXXXX        |9526 Noble Embers Ridge|Littleton    |CO            |80126           |
|3          |Ann       |Smith         |XXXXXXXXX     |XXXXXXXXX        |3422 Blue Pioneer Bend |Caguas       |PR            |00725           |
|4          |Mary      |Jones         |XXXXXXXXX     |XXXXXXXXX        |8324 Little Common     |San Marcos   |CA            |92069           |

#### Where

In [59]:
df_where = df_cust.where("customer_state = 'CA'")

In [60]:
df_where.show(5,0)

+-----------+--------------+--------------+--------------+-----------------+--------------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|customer_street           |customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------------+-------------+--------------+----------------+
|4          |Mary          |Jones         |XXXXXXXXX     |XXXXXXXXX        |8324 Little Common        |San Marcos   |CA            |92069           |
|14         |Katherine     |Smith         |XXXXXXXXX     |XXXXXXXXX        |5666 Hazy Pony Square     |Pico Rivera  |CA            |90660           |
|15         |Jane          |Luna          |XXXXXXXXX     |XXXXXXXXX        |673 Burning Glen          |Fontana      |CA            |92336           |
|18         |Robert        |Smith         |XXXXXXXXX     |XXXXXXXXX        |2734 Hazy Butterfly Circ

#### Drop

In [61]:
df_drop = df_cust.drop("customer_email", "customer_password")

In [62]:
df_drop.show(5,0)

+-----------+--------------+--------------+-----------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_street        |customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+-----------------------+-------------+--------------+----------------+
|1          |Richard       |Hernandez     |6303 Heather Plaza     |Brownsville  |TX            |78521           |
|2          |Mary          |Barrett       |9526 Noble Embers Ridge|Littleton    |CO            |80126           |
|3          |Ann           |Smith         |3422 Blue Pioneer Bend |Caguas       |PR            |00725           |
|4          |Mary          |Jones         |8324 Little Common     |San Marcos   |CA            |92069           |
|5          |Robert        |Hudson        |10 Crystal River Mall  |Caguas       |PR            |00725           |
+-----------+--------------+--------------+-----------------------+-------------+-------

#### Schema

In [63]:
df_cust.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- customer_fname: string (nullable = true)
 |-- customer_lname: string (nullable = true)
 |-- customer_email: string (nullable = true)
 |-- customer_password: string (nullable = true)
 |-- customer_street: string (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)
 |-- customer_zipcode: string (nullable = true)



#### Types

In [65]:
df_cust.dtypes

[('customer_id', 'int'),
 ('customer_fname', 'string'),
 ('customer_lname', 'string'),
 ('customer_email', 'string'),
 ('customer_password', 'string'),
 ('customer_street', 'string'),
 ('customer_city', 'string'),
 ('customer_state', 'string'),
 ('customer_zipcode', 'string')]

In [66]:
df_cust.dtypes[0]

('customer_id', 'int')

#### Cast

In [68]:
from pyspark.sql.types import StringType, IntegerType

In [71]:
df_cust.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- customer_fname: string (nullable = true)
 |-- customer_lname: string (nullable = true)
 |-- customer_email: string (nullable = true)
 |-- customer_password: string (nullable = true)
 |-- customer_street: string (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)
 |-- customer_zipcode: string (nullable = true)



In [69]:
df_cast = df_cust.select("customer_id", "customer_fname", "customer_lname", "customer_city", "customer_state", 
                         df_cust["customer_zipcode"].cast(IntegerType()))

In [70]:
df_cast.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- customer_fname: string (nullable = true)
 |-- customer_lname: string (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)
 |-- customer_zipcode: integer (nullable = true)



In [72]:
df_cast.show(5,0)

+-----------+--------------+--------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+-------------+--------------+----------------+
|1          |Richard       |Hernandez     |Brownsville  |TX            |78521           |
|2          |Mary          |Barrett       |Littleton    |CO            |80126           |
|3          |Ann           |Smith         |Caguas       |PR            |725             |
|4          |Mary          |Jones         |San Marcos   |CA            |92069           |
|5          |Robert        |Hudson        |Caguas       |PR            |725             |
+-----------+--------------+--------------+-------------+--------------+----------------+
only showing top 5 rows



#### Union / UnionAll

In [91]:
df_limit.show()

+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------

In [92]:
df_union = df_limit.union(df_limit)

In [93]:
df_union.show()

+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|          

In [94]:
df_unionall = df_limit.unionAll(df_limit)

In [95]:
df_unionall.show()

+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|          

#### Fillna

In [113]:
df_wc = df_cust.select("customer_id", "customer_fname", "customer_lname", "customer_city", "customer_state") \
                .withColumn("cust_fullname",  (df_cust["customer_fname"]+df_cust["customer_lname"]))

In [128]:
df_wc.show(5,0)

+-----------+--------------+--------------+-------------+--------------+-------------+
|customer_id|customer_fname|customer_lname|customer_city|customer_state|cust_fullname|
+-----------+--------------+--------------+-------------+--------------+-------------+
|1          |Richard       |Hernandez     |Brownsville  |TX            |null         |
|2          |Mary          |Barrett       |Littleton    |CO            |null         |
|3          |Ann           |Smith         |Caguas       |PR            |null         |
|4          |Mary          |Jones         |San Marcos   |CA            |null         |
|5          |Robert        |Hudson        |Caguas       |PR            |null         |
+-----------+--------------+--------------+-------------+--------------+-------------+
only showing top 5 rows



In [134]:
df_fill = df_wc.na.fill({'cust_fullname': 0, 'customer_state': 'TX1'})

In [135]:
df_fill.show(5,0)

+-----------+--------------+--------------+-------------+--------------+-------------+
|customer_id|customer_fname|customer_lname|customer_city|customer_state|cust_fullname|
+-----------+--------------+--------------+-------------+--------------+-------------+
|1          |Richard       |Hernandez     |Brownsville  |TX            |0.0          |
|2          |Mary          |Barrett       |Littleton    |CO            |0.0          |
|3          |Ann           |Smith         |Caguas       |PR            |0.0          |
|4          |Mary          |Jones         |San Marcos   |CA            |0.0          |
|5          |Robert        |Hudson        |Caguas       |PR            |0.0          |
+-----------+--------------+--------------+-------------+--------------+-------------+
only showing top 5 rows



## Wide Transformation

#### Group By

In [36]:
df_group = df_cust.groupby("customer_city").count()

In [37]:
df_group.show(10,0)

                                                                                

+---------------+-----+
|customer_city  |count|
+---------------+-----+
|Hanover        |9    |
|Caguas         |4584 |
|Corona         |25   |
|Tempe          |35   |
|Bowling Green  |8    |
|Springfield    |3    |
|Lawrenceville  |12   |
|North Las Vegas|12   |
|Palatine       |8    |
|Phoenix        |64   |
+---------------+-----+
only showing top 10 rows



#### Limit

In [55]:
df_limit = df_cust.limit(3)

In [56]:
df_limit.show()

+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------

#### Join - Inner

In [78]:
df_cust.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- customer_fname: string (nullable = true)
 |-- customer_lname: string (nullable = true)
 |-- customer_email: string (nullable = true)
 |-- customer_password: string (nullable = true)
 |-- customer_street: string (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)
 |-- customer_zipcode: string (nullable = true)



In [79]:
df_order.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [82]:
df_join = df_cust.join(df_order, df_cust["customer_id"] == df_order["order_customer_id"], "inner")

In [83]:
df_join.show(5,0)

[Stage 42:>                                                         (0 + 1) / 1]

+-----------+--------------+--------------+--------------+-----------------+----------------+-------------+--------------+----------------+--------+-------------------+-----------------+------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|customer_street |customer_city|customer_state|customer_zipcode|order_id|order_date         |order_customer_id|order_status|
+-----------+--------------+--------------+--------------+-----------------+----------------+-------------+--------------+----------------+--------+-------------------+-----------------+------------+
|148        |Stephanie     |Richards      |XXXXXXXXX     |XXXXXXXXX        |245 Lost Way    |Caguas       |PR            |00725           |15061   |2013-10-28 00:00:00|148              |CLOSED      |
|148        |Stephanie     |Richards      |XXXXXXXXX     |XXXXXXXXX        |245 Lost Way    |Caguas       |PR            |00725           |59569   |2013-10-03 00:00:00|148              |COMPLETE    |


                                                                                

#### Join - Left

In [85]:
df_leftjoin = df_cust.join(df_order, df_cust["customer_id"] == df_order["order_customer_id"], "left")

In [86]:
df_leftjoin.show(5,0)

[Stage 44:>                                                         (0 + 1) / 1]

+-----------+--------------+--------------+--------------+-----------------+----------------+-------------+--------------+----------------+--------+-------------------+-----------------+------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|customer_street |customer_city|customer_state|customer_zipcode|order_id|order_date         |order_customer_id|order_status|
+-----------+--------------+--------------+--------------+-----------------+----------------+-------------+--------------+----------------+--------+-------------------+-----------------+------------+
|148        |Stephanie     |Richards      |XXXXXXXXX     |XXXXXXXXX        |245 Lost Way    |Caguas       |PR            |00725           |15061   |2013-10-28 00:00:00|148              |CLOSED      |
|148        |Stephanie     |Richards      |XXXXXXXXX     |XXXXXXXXX        |245 Lost Way    |Caguas       |PR            |00725           |59569   |2013-10-03 00:00:00|148              |COMPLETE    |


                                                                                

#### Join - Right

In [87]:
df_rightjoin = df_cust.join(df_order, df_cust["customer_id"] == df_order["order_customer_id"], "right")

In [88]:
df_rightjoin.show(5,0)

[Stage 46:>                                                         (0 + 1) / 1]

+-----------+--------------+--------------+--------------+-----------------+----------------+-------------+--------------+----------------+--------+-------------------+-----------------+------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|customer_street |customer_city|customer_state|customer_zipcode|order_id|order_date         |order_customer_id|order_status|
+-----------+--------------+--------------+--------------+-----------------+----------------+-------------+--------------+----------------+--------+-------------------+-----------------+------------+
|148        |Stephanie     |Richards      |XXXXXXXXX     |XXXXXXXXX        |245 Lost Way    |Caguas       |PR            |00725           |15061   |2013-10-28 00:00:00|148              |CLOSED      |
|148        |Stephanie     |Richards      |XXXXXXXXX     |XXXXXXXXX        |245 Lost Way    |Caguas       |PR            |00725           |59569   |2013-10-03 00:00:00|148              |COMPLETE    |


                                                                                

#### Join - full

In [89]:
df_fulljoin = df_cust.join(df_order, df_cust["customer_id"] == df_order["order_customer_id"], "full")

In [90]:
df_fulljoin.show(5,0)

[Stage 50:>                                                         (0 + 1) / 1]

+-----------+--------------+--------------+--------------+-----------------+----------------+-------------+--------------+----------------+--------+-------------------+-----------------+------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|customer_street |customer_city|customer_state|customer_zipcode|order_id|order_date         |order_customer_id|order_status|
+-----------+--------------+--------------+--------------+-----------------+----------------+-------------+--------------+----------------+--------+-------------------+-----------------+------------+
|148        |Stephanie     |Richards      |XXXXXXXXX     |XXXXXXXXX        |245 Lost Way    |Caguas       |PR            |00725           |15061   |2013-10-28 00:00:00|148              |CLOSED      |
|148        |Stephanie     |Richards      |XXXXXXXXX     |XXXXXXXXX        |245 Lost Way    |Caguas       |PR            |00725           |59569   |2013-10-03 00:00:00|148              |COMPLETE    |


                                                                                

#### Distinct

In [96]:
df_distinct = df_unionall.distinct()

In [97]:
df_distinct.show()



+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------

                                                                                

#### DropDuplicate

In [104]:
df_dedup = df_unionall.drop_duplicates(["customer_state"])

In [139]:
df_dedup.show()



+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------

                                                                                

#### Sort

In [136]:
df_distinct.show()



+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------

                                                                                

In [137]:
df_sort = df_distinct.sort("customer_id")

In [138]:
df_sort.show()



+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|           00725|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------

                                                                                