# Writing Windows Function

You need to write Spark code to perform the steps mentioned below. You
should write it to optimize for execution speed.

#### Filter the complete dataset for DISH = (Biryani or Pizza or Dosa) from X, where X is the complete dataset
#### Group by ORDERID, STOREID, PRICE
#### If the number of records in the group are > 1
##### If the group just has records with DISH = Biryani
##### Just keep one record with the following updated:
TOTALAMOUNT = SUM(TOTALAMOUNT)
QUANTITY = SUM(QUANTITY)
JUSTANOTHERFEATURE = SUM(JUSTANOTHERFEATURE)
#### 2. If the group consists of records with DISH = Biryani and DISH = Pizza or Dosa
Compute SUM(TOTALAMOUNT)
If SUM(TOTALAMOUNT) ≤ 0, drop all these records
If SUM(TOTALAMOUNT) > 0,
Find and keep one record with a positive TOTALAMOUNT and DISH = Pizza
or Dosa and update the following:

TOTALAMOUNT = SUM(TOTALAMOUNT)
QUANTITY = SUM(QUANTITY)
JUSTANOTHERFEATURE = SUM(JUSTANOTHERFEATURE)
If SUM(QUANTITY)=0:
If TOTALAMOUNT > 0:
QUANTITY = 10
JUSTANOTHERFEATURE = 20
If TOTALAMOUNT < 0:
QUANTITY = -20
JUSTANOTHERFEATURE = -10

In case you can't find such a record, drop all these records

3. Update the column JUSTANOTHERFEATURE with a list of distinct values
which are greater than 1 in that column for the group.

 Else do nothing.
 Merge it back with X and write to the output to a S3 path partitioned by the
DISH name.

In case you assume anything about the problem statement while writing the
code for it, please do mention it.


### We can get Dataset from the given link
https://drive.google.com/file/d/1JbO0JS9UKZdqhwLrTRVYfivTLycfOrpC/

In [1]:
#!/usr/bin/env bash
!pwd

/home/rita/Documents/Spark/Assignments


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window,substring, col
from pyspark.sql.functions import *
from datetime import datetime
import time

In [3]:
spark = SparkSession.builder.appName("Window Function Assignment").master("local[*]").getOrCreate()

22/01/25 15:44:47 WARN Utils: Your hostname, EMPID21092 resolves to a loopback address: 127.0.1.1; using 192.168.1.6 instead (on interface wlp3s0)
22/01/25 15:44:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/25 15:44:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/25 15:44:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [4]:
df = spark.read.format("parquet").options(header='true', inferschema='true', delimiter=',').load("/home/rita/Documents/Spark/Assignments/spark_assignment_data.snappy.parquet")

[Stage 0:>                                                          (0 + 1) / 1]                                                                                

In [5]:
df.show()

+-----+--------+-------+-----------------+------------------+--------+------------------+
| DISH| ORDERID|STOREID|            PRICE|       TOTALAMOUNT|QUANTITY|JUSTANOTHERFEATURE|
+-----+--------+-------+-----------------+------------------+--------+------------------+
|Pizza|ggeabhic|  fihdf|            44.39|23.532000000000004|     2.0|               2.0|
|Pizza|ccfcbhid|  fihdf|            44.39|11.766000000000002|     1.0|               1.0|
|Pizza| diebhid|  hbaeh|67.49000000000001|            39.812|     2.0|               2.0|
|Pizza|gdigbhid|  fihdf|            44.39|11.766000000000002|     1.0|               1.0|
|Pizza|cefdbhid|  eihde|            44.39|11.766000000000002|     1.0|               1.0|
|Pizza|bdcjbhid|  hfaeh|            44.39|            35.298|     3.0|               3.0|
|Pizza|hgbfbhid|  fihdf|            44.39|23.532000000000004|     2.0|               2.0|
|Pizza|edjebhid|  fihdf|            44.39|23.532000000000004|     2.0|               2.0|
|Pizza| de

In [7]:
print((df.count(), len(df.columns)))

(1129292, 7)


### Filter the complete dataset for DISH = (Biryani or Pizza or Dosa) from X, where X is the complete dataset

In [9]:
dishes = ['Biryani','Pizza','Dosa']
filtered_df = df.filter(df.DISH.isin(dishes))
filtered_df.show()

+-----+--------+-------+-----------------+------------------+--------+------------------+
| DISH| ORDERID|STOREID|            PRICE|       TOTALAMOUNT|QUANTITY|JUSTANOTHERFEATURE|
+-----+--------+-------+-----------------+------------------+--------+------------------+
|Pizza|ggeabhic|  fihdf|            44.39|23.532000000000004|     2.0|               2.0|
|Pizza|ccfcbhid|  fihdf|            44.39|11.766000000000002|     1.0|               1.0|
|Pizza| diebhid|  hbaeh|67.49000000000001|            39.812|     2.0|               2.0|
|Pizza|gdigbhid|  fihdf|            44.39|11.766000000000002|     1.0|               1.0|
|Pizza|cefdbhid|  eihde|            44.39|11.766000000000002|     1.0|               1.0|
|Pizza|bdcjbhid|  hfaeh|            44.39|            35.298|     3.0|               3.0|
|Pizza|hgbfbhid|  fihdf|            44.39|23.532000000000004|     2.0|               2.0|
|Pizza|edjebhid|  fihdf|            44.39|23.532000000000004|     2.0|               2.0|
|Pizza| de

In [10]:
print((filtered_df.count(), len(filtered_df.columns)))

(805280, 7)


In [29]:
grouped_filtered_data = filtered_df.select('*').groupby(col('ORDERID'),col('STOREID'),col('PRICE')).agg(count('*').alias("records"))
grouped_filtered_data.show()



+--------+---------+------------------+-------+
| ORDERID|  STOREID|             PRICE|records|
+--------+---------+------------------+-------+
|dfebbhid|    dihdd|             44.39|      1|
|eabhbhid|    bgaeb| 67.49000000000001|      1|
| djebhid|    hfaeh|             44.39|      1|
|dahibhid|    achha|            140.99|      1|
|jjijbhid|    gbigg|             69.59|      1|
|cghebhid|   abiaha|            130.49|      1|
|cbhebhid|   afijia|262.78999999999996|      1|
|gjjhbhid|   ccjcbc|             31.79|      1|
|icijbhid|   bgadjb|             56.99|      1|
|jddbbhid|   fjddif|             52.79|      1|
| bfebhid|   bgadjb|             56.99|      1|
|iacebhid|   dcejjd|             56.99|      1|
|dgaibhid|jcjiihcbj|             46.49|      1|
|fgcebhid|fgdbgdiff|            199.79|      1|
|  dgbhid|abgjhfcba|             73.79|      1|
|cdgabhid|fgcdfbcbf|             52.79|      1|
|ghgjbhid|abffgiifa|             42.29|      1|
|ibjbbhid|   eeegce|             94.79| 

                                                                                

In [32]:
df_1 = grouped_filtered_data.withColumn("count", when(grouped_filtered_data.records > 1, sum(grouped_filtered_data.total_amount)))
df_1.show()

AttributeError: 'DataFrame' object has no attribute 'total_amount'