# Writing Windows Function

You need to write Spark code to perform the steps mentioned below. You
should write it to optimize for execution speed.

#### 1. Filter the complete dataset for DISH = (Biryani or Pizza or Dosa) from X, where X is the complete dataset
#### 2. Group by ORDERID, STOREID, PRICE
#### 3. If the number of records in the group are > 1
#####         3.1 Calculate no. of records per dish
#####         3.2 Calculate total no.of records
######       In case you can't find such a record, drop all these records
##### 3.3 Update the column JUSTANOTHERFEATURE with a list of distinct values which are greater than 1 in that column for the group.
#### 4. Else do nothing.
#### 5. Merge it back with X and write to the output to a S3 path partitioned by the DISH name.

In case you assume anything about the problem statement while writing the code for it, please do mention it.


### We can get Dataset from the given link
https://drive.google.com/file/d/1JbO0JS9UKZdqhwLrTRVYfivTLycfOrpC/

In [145]:
#!/usr/bin/env bash
!pwd

/home/rita/Documents/Spark/Assignments


In [3]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, window,substring, col,col, udf, collect_list, collect_set, count, when,sum
from pyspark.sql.functions import *
from datetime import datetime
import time
from pyspark.sql import Window
from pyspark.sql.types import  *

In [4]:
spark = SparkSession.builder.appName("Window Function Assignment").master("local[*]").getOrCreate()
data = spark.read.format("parquet")\
.options(header='true', inferschema='true', delimiter=',')\
.load("/home/rita/Documents/Spark/Assignments/spark_assignment_data.snappy.parquet")

In [5]:
#data.toPandas().to_csv('data.csv')

In [6]:
print((data.count(), len(data.columns)))

(1129292, 7)


### Filter the complete dataset for DISH = (Biryani or Pizza or Dosa) from X, where X is the complete dataset

In [7]:
dishes = ['Biryani','Pizza','Dosa']
filtered_df = data.filter(data.DISH.isin(dishes))
#filtered_df.toPandas().to_csv('filtered.csv')

In [8]:
print((filtered_df.count(), len(filtered_df.columns)))

(805280, 7)


In [9]:
dish_part = Window.partitionBy("DISH")

count_dishes = filtered_df.select('DISH').groupby(col('DISH')).agg(count(col('DISH')))
count_dishes.show()

+-------+-----------+
|   DISH|count(DISH)|
+-------+-----------+
|   Dosa|       4938|
|Biryani|        585|
|  Pizza|     799757|
+-------+-----------+



In [11]:
partition_ = Window.partitionBy("ORDERID","STOREID","PRICE").orderBy("DISH")

grouped_df = filtered_df.withColumn("records", count("*").over(partition_))
#grouped_df.toPandas().to_csv('grouped.csv')

In [12]:
count_dishes = grouped_df.select('DISH').groupby(col('DISH')).agg(count(col('DISH')))
count_dishes.show()

+-------+-----------+
|   DISH|count(DISH)|
+-------+-----------+
|   Dosa|       4938|
|Biryani|        585|
|  Pizza|     799757|
+-------+-----------+



In [13]:
records_greater_than_1 = grouped_df.filter(grouped_df.records > 1)
#records_greater_than_1.toPandas().to_csv('records_greater_than_1.csv')

In [15]:
count_dishes = records_greater_than_1.select('DISH').groupby(col('DISH')).agg(count(col('DISH')))
count_dishes.show()

+-----+-----------+
| DISH|count(DISH)|
+-----+-----------+
| Dosa|        403|
|Pizza|      13229|
+-----+-----------+



In [16]:
no_of_records = records_greater_than_1.agg(count("*"))
no_of_records.show()

+--------+
|count(1)|
+--------+
|   13632|
+--------+



In [17]:
'''
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.11.30.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar
sudo wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar
'''

'\nsudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.11.30.jar\nsudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar\nsudo wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar\n'

In [18]:
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "AKIATCEGQATGUTELSLUZ")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "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")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "")

In [19]:
'''try:
    records_greater_than_1.coalesce(1) \
    .write.option("header",True) \
        .option("maxRecordsPerFile", 5000) \
        .partitionBy("DISH") \
        .mode("overwrite") \
        .format("json") \
    .save("output_")
except Exception as e:
    print(e)'''

'try:\n    records_greater_than_1.coalesce(1)     .write.option("header",True)         .option("maxRecordsPerFile", 5000)         .partitionBy("DISH")         .mode("overwrite")         .format("json")     .save("output_")\nexcept Exception as e:\n    print(e)'

In [20]:
try:
    records_greater_than_1.coalesce(1) \
    .write.option("header",True) \
        .option("maxRecordsPerFile", 5000) \
        .partitionBy("DISH") \
        .mode("overwrite") \
        .format("json") \
    .save("s3a://giruu/output_")
except Exception as e:
    print(e)

An error occurred while calling o179.save.
: java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2604)
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2569)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.spark.sql.exec