# Exercise - Get details of inactive customers

- Data is available in local file system /public/retail_db
- Source directories: /public/retail_db/retail_db/orders and /public/retail_db/customers
- Source delimiter: comma (“,”)
- Source Columns - orders - order_id, order_date, order_customer_id, order_status
- Source Columns - customers - customer_id, customer_fname, customer_lname and many more
- Get the customers who have not placed any orders, sorted by customer_lname and then customer_fname
- Target Columns: customer_lname, customer_fname
- Number of files - 1
- Target Directory: /user/YOUR_USER_ID/solutions/solutions02/inactive_customers
- Target File Format: csv
- Target Delimiter: comma (“, ”)
- Compression: N/A
- Validate the results

In [22]:
%%bash

hadoop fs -ls /public/retail_db

Found 6 items
drwxr-xr-x   - hdfs hdfs          0 2016-12-19 03:52 /public/retail_db/categories
drwxr-xr-x   - hdfs hdfs          0 2016-12-19 03:52 /public/retail_db/customers
drwxr-xr-x   - hdfs hdfs          0 2016-12-19 03:52 /public/retail_db/departments
drwxr-xr-x   - hdfs hdfs          0 2020-07-12 16:50 /public/retail_db/order_items
drwxr-xr-x   - hdfs hdfs          0 2020-07-14 01:35 /public/retail_db/orders
drwxr-xr-x   - hdfs hdfs          0 2016-12-19 03:52 /public/retail_db/products


In [23]:
%%bash

hadoop fs -ls /public/retail_db/orders

Found 1 items
-rw-r--r--   2 hdfs hdfs    2999944 2020-07-14 01:35 /public/retail_db/orders/part-00000


In [25]:
%%bash

hadoop fs -ls /public/retail_db/orders

Found 1 items
-rw-r--r--   2 hdfs hdfs    2999944 2020-07-14 01:35 /public/retail_db/orders/part-00000


In [27]:
%%bash

hadoop fs -cat /public/retail_db/orders/part-00000 | head

#order_id, order_date, order_customer_id, order_status

1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT


cat: Unable to write to output stream.


In [28]:
%%bash

hadoop fs -cat /public/retail_db/customers/part-00000 | head

#customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode

1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521
2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126
3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,00725
4,Mary,Jones,XXXXXXXXX,XXXXXXXXX,8324 Little Common,San Marcos,CA,92069
5,Robert,Hudson,XXXXXXXXX,XXXXXXXXX,"10 Crystal River Mall ",Caguas,PR,00725
6,Mary,Smith,XXXXXXXXX,XXXXXXXXX,3151 Sleepy Quail Promenade,Passaic,NJ,07055
7,Melissa,Wilcox,XXXXXXXXX,XXXXXXXXX,9453 High Concession,Caguas,PR,00725
8,Megan,Smith,XXXXXXXXX,XXXXXXXXX,3047 Foggy Forest Plaza,Lawrence,MA,01841
9,Mary,Perez,XXXXXXXXX,XXXXXXXXX,3616 Quaking Street,Caguas,PR,00725
10,Melissa,Smith,XXXXXXXXX,XXXXXXXXX,8598 Harvest Beacon Plaza,Stafford,VA,22554


cat: Unable to write to output stream.


In [29]:
from pyspark.sql import SparkSession

from pyspark.sql.functions import *

In [30]:
spark = (SparkSession
         .builder
         .config('spark.ui.port', 0)
         .appName('InactiveCustomers')
         .master('yarn')
         .getOrCreate()
)

In [31]:
ordersSchema = "`order_id` INT, `order_date` STRING, `order_customer_id` INT, `order_status` STRING"

customersSchema = """`customer_id` INT , `customer_fname` STRING, `customer_lname` STRING, `customer_email` STRING, `customer_password` STRING, `customer_street` STRING, `customer_city` STRING, `customer_state` STRING, `customer_zipcode` INT"""

In [37]:
orders_df = (spark
          .read
          .schema(ordersSchema)
          .csv("/public/retail_db/orders/")
)

orders_df.show()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|             1837|         CLOSED|
|      13|

In [38]:
customers_df = (spark
          .read
          .schema(customersSchema)
          .csv("/public/retail_db/customers/")
)

customers_df.show()

+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|customer_id|customer_fname|customer_lname|customer_email|customer_password|     customer_street|customer_city|customer_state|customer_zipcode|
+-----------+--------------+--------------+--------------+-----------------+--------------------+-------------+--------------+----------------+
|          1|       Richard|     Hernandez|     XXXXXXXXX|        XXXXXXXXX|  6303 Heather Plaza|  Brownsville|            TX|           78521|
|          2|          Mary|       Barrett|     XXXXXXXXX|        XXXXXXXXX|9526 Noble Embers...|    Littleton|            CO|           80126|
|          3|           Ann|         Smith|     XXXXXXXXX|        XXXXXXXXX|3422 Blue Pioneer...|       Caguas|            PR|             725|
|          4|          Mary|         Jones|     XXXXXXXXX|        XXXXXXXXX|  8324 Little Common|   San Marcos|            CA|          

In [46]:
customerWithNoOrders_df = (customers_df
                      .join(orders_df, customers_df.customer_id == orders_df.order_customer_id, 'left')
                      .filter(orders_df.order_status.isNull())
                     )

customerWithNoOrders_df = customerWithNoOrders_df.select(customers_df.customer_lname, customers_df.customer_fname).sort("customer_lname", "customer_fname")

customerWithNoOrders_df.show()

+--------------+--------------+
|customer_lname|customer_fname|
+--------------+--------------+
|        Bolton|          Mary|
|       Ellison|        Albert|
|         Green|       Carolyn|
|        Greene|          Mary|
|       Harrell|          Mary|
|         Lewis|          Mary|
|       Mueller|          Mary|
|         Patel|       Matthew|
|          Shaw|          Mary|
|         Smith|        Amanda|
|         Smith|        Ashley|
|         Smith|          Carl|
|         Smith|          Emma|
|         Smith|         Grace|
|         Smith|         James|
|         Smith|          Joan|
|         Smith|       Kenneth|
|         Smith|         Kevin|
|         Smith|          Mary|
|         Smith|          Mary|
+--------------+--------------+
only showing top 20 rows



In [54]:
(customerWithNoOrders_df
  .coalesce(1)
  .write
  .mode("overwrite")
  .option("header", "True")
  .csv("/user/ranga_rao/solutions/solutions02/inactive_customers")
)

In [55]:
%%bash

hadoop fs -ls "/user/ranga_rao/solutions/solutions02/inactive_customers"

Found 2 items
-rw-r--r--   2 ranga_rao hdfs          0 2020-09-11 05:37 /user/ranga_rao/solutions/solutions02/inactive_customers/_SUCCESS
-rw-r--r--   2 ranga_rao hdfs        402 2020-09-11 05:37 /user/ranga_rao/solutions/solutions02/inactive_customers/part-00000-e5f5a0f2-3026-4256-97c1-916b9c8aee74-c000.csv


In [57]:
%%bash

hadoop fs -cat "/user/ranga_rao/solutions/solutions02/inactive_customers/part-00000-e5f5a0f2-3026-4256-97c1-916b9c8aee74-c000.csv" | head

customer_lname,customer_fname
Bolton,Mary
Ellison,Albert
Green,Carolyn
Greene,Mary
Harrell,Mary
Lewis,Mary
Mueller,Mary
Patel,Matthew
Shaw,Mary


In [59]:
customerWithNoOrders_verification_df = (spark
          .read
          .option("header", "True")
          .option("inferSchema", "True")
          .csv("/user/ranga_rao/solutions/solutions02/inactive_customers")
)

customerWithNoOrders_verification_df.show()

+--------------+--------------+
|customer_lname|customer_fname|
+--------------+--------------+
|        Bolton|          Mary|
|       Ellison|        Albert|
|         Green|       Carolyn|
|        Greene|          Mary|
|       Harrell|          Mary|
|         Lewis|          Mary|
|       Mueller|          Mary|
|         Patel|       Matthew|
|          Shaw|          Mary|
|         Smith|        Amanda|
|         Smith|        Ashley|
|         Smith|          Carl|
|         Smith|          Emma|
|         Smith|         Grace|
|         Smith|         James|
|         Smith|          Joan|
|         Smith|       Kenneth|
|         Smith|         Kevin|
|         Smith|          Mary|
|         Smith|          Mary|
+--------------+--------------+
only showing top 20 rows



In [61]:
customerWithNoOrders_verification_df.count()

30

In [None]:
|