### Overview of Business ->
#### ^ There is a small restaurant that sells some fantastic Indian food items. They need our assistance to help the restaurant stay afloat. The restaurant has captured some data from its few months of operation but has yet to learn how to use its data to help them run the business.
#### ^ Now we want to use the data to answer a few questions about his customers, especially their visiting patterns, the money they’ve spent, and which menu items are the customer’s favorite.

### DataFrame Creation | PySpark, and Databricks
#### The restaurant has sales data which contains customers’ information. We have menu data with food items-related information and members data with subscriptions related information if any customer has purchased it.

In [0]:
# we are using databricks inbuild spark session i.e. spark
spark

In [0]:
# we are creating a dataframe for sales data
sales_data = ([
    ('A', '2021-01-01', '1'), ('A', '2021-01-01', '2'), ('A', '2021-01-07', '2'),
    ('A', '2021-01-10', '3'), ('A', '2021-01-11', '3'), ('A', '2021-01-11', '3'),
    ('B', '2021-01-01', '2'), ('B', '2021-01-02', '2'), ('B', '2021-01-04', '1'),
    ('B', '2021-01-11', '1'), ('B', '2021-01-16', '3'), ('B', '2021-02-01', '3'),
    ('C', '2021-01-01', '3'), ('C', '2021-01-01', '1'), ('C', '2021-01-07', '3')
])
sales_cols = ["cust_id", "order_date", "prod_id"]
sales_df = spark.createDataFrame(data=sales_data, schema=sales_cols)
sales_df.display()

cust_id,order_date,prod_id
A,2021-01-01,1
A,2021-01-01,2
A,2021-01-07,2
A,2021-01-10,3
A,2021-01-11,3
A,2021-01-11,3
B,2021-01-01,2
B,2021-01-02,2
B,2021-01-04,1
B,2021-01-11,1


In [0]:
# Now we are creating dataframe for menu data
menu_data = ([
    ('1', 'palak_paneer', 100), ('2', 'chicken_tikka', 150), ('3', 'jeera_rice', 120),
    ('4', 'kheer', 110), ('5', 'vada_pav', 80), ('6', 'paneer_tikka', 180)
])
menu_cols = ["prod_id", "prod_name", "price"]
menu_df = spark.createDataFrame(data=menu_data, schema=menu_cols)
menu_df.display()

prod_id,prod_name,price
1,palak_paneer,100
2,chicken_tikka,150
3,jeera_rice,120
4,kheer,110
5,vada_pav,80
6,paneer_tikka,180


In [0]:
# Now we are creating dataframe for member data
mem_data = ([
    ('A', '2021-01-07'), ('B', '2021-01-09')
])
mem_cols = ["cust_id", "join_date"]
mem_df = spark.createDataFrame(data=mem_data, schema=mem_cols)
mem_df.display()

cust_id,join_date
A,2021-01-07
B,2021-01-09


In [0]:
# As we know to perform sql operation on dataframe we need to create a temp view
sales_df.createOrReplaceTempView("sales_tb")
spark.sql("select * from sales_tb").limit(5).display()
menu_df.createOrReplaceTempView("menu_tb")
spark.sql("select * from menu_tb").display()
mem_df.createOrReplaceTempView("mem_tb")
spark.sql("select * from mem_tb").display()

cust_id,order_date,prod_id
A,2021-01-01,1
A,2021-01-01,2
A,2021-01-07,2
A,2021-01-10,3
A,2021-01-11,3


prod_id,prod_name,price
1,palak_paneer,100
2,chicken_tikka,150
3,jeera_rice,120
4,kheer,110
5,vada_pav,80
6,paneer_tikka,180


cust_id,join_date
A,2021-01-07
B,2021-01-09


## Solving Problem Statements of Restaurants using PySpark
### now we have 3 dataframes sales_df, menu_df and mem_df 
### and 3 temp views sales_tb, menu_tb, mem_tb

### Question 01:- What is the total amount each customer spent at the restaurant?

In [0]:
# so we have 3 customers A, B, C 
# and havw to total amount spend by each customer
result = spark.sql("""
                        select cust_id, sum(price) As amount_spent 
                        from sales_tb join menu_tb
                        where sales_tb.prod_id == menu_tb.prod_id
                        group by cust_id
                        order by sum(price) desc;
                """)

result.display()

cust_id,amount_spent
A,760
B,740
C,340


In [0]:
from pyspark.sql.functions import col
total_spent_df = sales_df.join(menu_df, "prod_id").groupBy("cust_id").agg({"price":"sum"})\
                    .withColumnRenamed('sum(price)', 'total_spent_amounts').orderBy('cust_id')

total_spent_df.display()

cust_id,total_spent_amounts
A,760
B,740
C,340


### Question 02:- How many days has each customer visited the restaurant?

In [0]:
result = spark.sql("""
                        select cust_id, count(distinct order_date) as visited_days from sales_tb
                        group by cust_id
                        order by visited_days;
                """)
result.display()

cust_id,visited_days
C,2
A,4
B,6


In [0]:
from pyspark.sql.functions import countDistinct
visit_df = sales_df.groupBy("cust_id").agg(countDistinct("order_date"))\
    .withColumnRenamed("count(order_date)", "visit_days").orderBy("cust_id")
visit_df.display()

cust_id,count(DISTINCT order_date)
A,4
B,6
C,2


## Question 03:- What was each customer’s first item from the menu?

In [0]:

result = spark.sql("""
                        WITH first_item AS (
                                select s.cust_id, s.order_date, m.prod_name,
                                        dense_rank() OVER (PARTITION BY s.cust_id order by s.order_date) as order_rank
                                from sales_tb s
                                join menu_tb m
                                on s.prod_id = m.prod_id
                        )
                        select cust_id, prod_name
                        from first_item
                        where order_rank = 1;
                """)
result.display()

cust_id,prod_name
A,palak_paneer
A,chicken_tikka
B,chicken_tikka
C,palak_paneer
C,jeera_rice


In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
window_spec = Window.partitionBy("cust_id").orderBy("order_date")
items_purchased_df = sales_df.join(menu_df, "prod_id")\
        .withColumn("dence_rank", dense_rank().over(window_spec))\
        .select("cust_id", "prod_name")\
        .filter("dence_rank == 1")\
        .orderBy("cust_id")
    
items_purchased_df.display()

cust_id,prod_name
A,palak_paneer
A,chicken_tikka
B,chicken_tikka
C,palak_paneer
C,jeera_rice


## Question 04:- Find out the most purchased item from the menu and how many times the customers purchased it.

In [0]:
result = spark.sql("""
                        select prod_name, count(s.prod_id) as product_purchase 
                        from menu_tb m join sales_tb s on m.prod_id = s.prod_id
                        group by prod_name, s.prod_id
                        order by count(s.prod_id) desc
                        limit 1
                """)
result.display()

prod_name,product_purchase
jeera_rice,7


In [0]:
from pyspark.sql.functions import count
most_purchased_df = menu_df.join(sales_df, "prod_id")\
        .groupBy("prod_id", "prod_name")\
        .agg(count("prod_id").alias("product_count"))\
        .orderBy("product_count", ascending=0)\
        .drop("prod_id")\
        .limit(1)
most_purchased_df.display()

prod_name,product_count
jeera_rice,7


## Question 05:- Which item was the most popular for each customer?

In [0]:
spark.sql("""
                with most_popular as (
                    select s.cust_id, m.prod_name, count(s.prod_id) as prod_count,
                    dense_rank() over (partition by s.cust_id order by count(s.prod_id) desc) as p_rank
                    from sales_tb s join menu_tb m 
                    on s.prod_id = m.prod_id
                    group by s.cust_id, m.prod_name
                )
                select cust_id, prod_name, prod_count
                from most_popular
                where p_rank = 1
        """).display()

cust_id,prod_name,prod_count
A,jeera_rice,3
B,chicken_tikka,2
B,palak_paneer,2
B,jeera_rice,2
C,jeera_rice,2


In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import col,dense_rank

sales_df.join(menu_df, "prod_id")\
    .groupBy("cust_id", "prod_name")\
    .agg(count("prod_id").alias("prod_count"))\
    .withColumn("dence_rank", dense_rank().over(Window.partitionBy("cust_id").orderBy(col("prod_count").desc())))\
    .filter("dence_rank = 1")\
    .drop("dence_rank")\
    .display()

cust_id,prod_name,prod_count
A,jeera_rice,3
B,chicken_tikka,2
B,palak_paneer,2
B,jeera_rice,2
C,jeera_rice,2


## Question 06:- Which item was ordered first by the customer after becoming a restaurant member?

In [0]:
# here we are getting all the prod_name after cust become memmber of resto.

spark.sql("""
                select s.cust_id, s.order_date, m.prod_name from sales_tb s
                join menu_tb m on s.prod_id = m.prod_id
                join mem_tb mm on s.cust_id = mm.cust_id
                where s.order_date > mm.join_date
        
        """).display()

cust_id,order_date,prod_name
A,2021-01-11,jeera_rice
A,2021-01-11,jeera_rice
A,2021-01-10,jeera_rice
B,2021-02-01,jeera_rice
B,2021-01-16,jeera_rice
B,2021-01-11,palak_paneer


In [0]:
# but our problem statement is -> Which item was ordered first by the customer after becoming a restaurant member?
spark.sql("""
                with ordered_first as (
                    select s.cust_id, s.order_date, m.prod_name,
                    dense_rank() over (partition by s.cust_id order by s.order_date) as c_rank
                    from sales_tb s
                    join menu_tb m on s.prod_id = m.prod_id
                    join mem_tb mm on s.cust_id = mm.cust_id
                    where s.order_date >= mm.join_date
                )

                select cust_id, order_date, prod_name 
                from ordered_first
                where c_rank = 1
        
        """).display()

cust_id,order_date,prod_name
A,2021-01-07,chicken_tikka
B,2021-01-11,palak_paneer


In [0]:
sales_df.join(mem_df, "cust_id")\
    .filter(sales_df.order_date >= mem_df.join_date)\
    .withColumn("dense_rank", dense_rank().over(Window.partitionBy("cust_id").orderBy("order_date")))\
    .filter("dense_rank = 1")\
    .join(menu_df, "prod_id")\
    .select("cust_id", "order_date", "prod_name").orderBy("cust_id")\
    .display()

cust_id,order_date,prod_name
A,2021-01-07,chicken_tikka
B,2021-01-11,palak_paneer


## Question 07:- Which item was purchased before the customer became a member?

In [0]:
# here just change the filter condition and we get our answer
spark.sql("""
                with ordered_first as (
                    select s.cust_id, s.order_date, m.prod_name,
                    dense_rank() over (partition by s.cust_id order by s.order_date) as c_rank
                    from sales_tb s
                    join menu_tb m on s.prod_id = m.prod_id
                    join mem_tb mm on s.cust_id = mm.cust_id
                    where s.order_date < mm.join_date
                )

                select cust_id, order_date, prod_name 
                from ordered_first
                where c_rank = 1
        
        """).display()

cust_id,order_date,prod_name
A,2021-01-01,chicken_tikka
A,2021-01-01,palak_paneer
B,2021-01-01,chicken_tikka


## Question 08:- What is the total items and amount spent for each member before they became a member?

In [0]:
spark.sql("""
                select s.cust_id, count(distinct s.prod_id) as item_count, sum(m.price) as total_amount
                from sales_tb s join menu_tb m on s.prod_id = m.prod_id
                join mem_tb mm on s.cust_id = mm.cust_id
                where s.order_date < mm.join_date
                group by s.cust_id
                order by total_amount desc
""").display()

cust_id,item_count,total_amount
B,2,400
A,2,250


In [0]:
from pyspark.sql.functions import countDistinct, sum

sales_df.join(menu_df, "prod_id").join(mem_df, "cust_id")\
        .filter(sales_df.order_date < mem_df.join_date)\
        .groupBy("cust_id")\
        .agg(countDistinct("prod_id").alias("total_item"), sum("price").alias("total_amount"))\
        .orderBy("total_amount", ascending=0)\
        .display()

cust_id,total_item,total_amount
B,2,400
A,2,250


## Question 09:- If each rupee spent equates to 10 points and item ‘jeera_rice’ has a 2x points multiplier, find out how many points each customer would have.

In [0]:
spark.sql("""
                select s.cust_id, sum(m.price * 10 * if(m.prod_name = "jeera_rice", 2, 1)) as total_point
                from sales_tb s join menu_tb m on s.prod_id = m.prod_id
                group by s.cust_id
                order by total_point desc
""").display()

cust_id,total_point
A,11200
B,9800
C,5800


In [0]:
from pyspark.sql.functions import when, col

sales_df.join(menu_df, "prod_id")\
    .withColumn("total_point", when(col("prod_name") == "jeera_rice", col("price")*20)\
                                .otherwise(col("price")*10))\
    .groupBy("cust_id")\
    .agg(sum("total_point").alias("reward_points"))\
    .orderBy("reward_points", ascending = 0)\
    .display()

cust_id,reward_points
A,11200
B,9800
C,5800


## Question 10:- Create the complete table with all data and columns like customer_id, order_date, product_name, price, and member(Y/N).

In [0]:
# here we perform left join because
# we want all data from all tables and according to condition deciside, it's member of resto. or not

spark.sql("""
                select s.cust_id, s.order_date, m.prod_name, m.price, 
                    case
                        when s.order_date >= mm.join_date then "Y"
                        when s.order_date < mm.join_date then "N"
                        else "N"
                    end as member
                from sales_tb s left join menu_tb m on s.prod_id = m.prod_id
                left join mem_tb mm on s.cust_id = mm.cust_id 
""").display()

cust_id,order_date,prod_name,price,member
A,2021-01-01,palak_paneer,100,N
A,2021-01-01,chicken_tikka,150,N
A,2021-01-07,chicken_tikka,150,Y
A,2021-01-10,jeera_rice,120,Y
A,2021-01-11,jeera_rice,120,Y
A,2021-01-11,jeera_rice,120,Y
B,2021-01-01,chicken_tikka,150,N
B,2021-01-02,chicken_tikka,150,N
B,2021-01-04,palak_paneer,100,N
B,2021-01-11,palak_paneer,100,Y


In [0]:
sales_df.join(menu_df, "prod_id", "left").join(mem_df, "cust_id","left")\
        .withColumn("member", when(col("order_date")<col("join_date"), "N")\
                    .when(col("order_date")>=col("join_date"), "Y")\
                        .otherwise("N"))\
        .drop("prod_id", "join_date")\
        .display()
        

cust_id,order_date,prod_name,price,member
A,2021-01-01,palak_paneer,100,N
A,2021-01-01,chicken_tikka,150,N
A,2021-01-07,chicken_tikka,150,Y
A,2021-01-10,jeera_rice,120,Y
A,2021-01-11,jeera_rice,120,Y
A,2021-01-11,jeera_rice,120,Y
B,2021-01-01,chicken_tikka,150,N
B,2021-01-04,palak_paneer,100,N
B,2021-01-02,chicken_tikka,150,N
B,2021-01-16,jeera_rice,120,Y
