# Accenture (Hard Level) - PySpark Interview Question

You are given a dataset containing sales data for different stores across various months. Each row contains the store name, the month, and the sales amount. Your task is to calculate the cumulative sales for each store, considering the monthly sales, using PySpark.

You should also:
Filter out stores with sales lower than 1000 in any month.
Calculate the total sales for each store over all months.
Sort the results by the total sales in descending order.


In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import *

In [0]:
data = [ ("Store A", "2024-01", 800), ("Store A", "2024-02", 1200), ("Store A", "2024-03", 900), ("Store B", "2024-01", 1500), ("Store B", "2024-02", 1600), ("Store B", "2024-03", 1400), ("Store C", "2024-01", 700), ("Store C", "2024-02", 1000), ("Store C", "2024-03", 800) ] 

df = spark.createDataFrame(data, ["Store", "Month", "Sales"]) 

In [0]:
df.display()

Store,Month,Sales
Store A,2024-01,800
Store A,2024-02,1200
Store A,2024-03,900
Store B,2024-01,1500
Store B,2024-02,1600
Store B,2024-03,1400
Store C,2024-01,700
Store C,2024-02,1000
Store C,2024-03,800


In [0]:
window_criteria = Window.partitionBy('Store').orderBy(col('Month').asc())

cum_sales = (df.filter(col('Sales') >= 1000)
              .withColumn(
                  'total_cum_sales'
                  , sum('Sales').over(window_criteria)
               )).alias('df_cum_sales')
              

df_total_sales = (cum_sales.groupBy('Store')
                           .agg(
                               sum(col('Sales')).alias('total_sales')
                           )
                ).alias('df_total_sales')

(cum_sales.join(
                df_total_sales
                , on=cum_sales['Store'] == df_total_sales['Store']
                , how='inner'
            )
            .select('df_cum_sales.Store', 'total_cum_sales', 'total_sales')
            .orderBy(col('total_sales').desc())
            .display()
)

Store,total_cum_sales,total_sales
Store B,4500,4500
Store B,3100,4500
Store B,1500,4500
Store A,1200,1200
Store C,1000,1000
