## Task 4 
**Top 2 Highest earning sellers by each location.**</br>
> We have 2 tables with sales history and data with all of the sellers. Sellers are assigned to one location. Our task is to find two sellers with highest profits grouped by location. Because we have information about state and city we will find top sellers grouping by those two categories. We assume that the revenue is equal to price of product subtracted by the freight price that the sellers has to settle. At the end we will create the map of the whole country with scaled markers according to the number of sales done and revenue gained. 

In [15]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import TimestampType
import pyspark.sql.functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder.getOrCreate()

In [16]:
# Orders data
# "order_id","order_item_id","product_id","seller_id","shipping_limit_date","price","freight_value"
olist_order_items_dataset = spark.read.options(header='True', inferSchema='True', delimiter=',') \
                            .csv("data/olist_order_items_dataset.csv")
olist_order_items_dataset.show(2)
# Sellers data 
# "seller_id","seller_zip_code_prefix","seller_city","seller_state"
olist_sellers_dataset = spark.read.options(header='True', inferSchema='True', delimiter=',') \
                            .csv("data/olist_sellers_dataset.csv")
olist_sellers_dataset.show(2)

+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
|            order_id|order_item_id|          product_id|           seller_id|shipping_limit_date|price|freight_value|
+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
|00010242fe8c5a6d1...|            1|4244733e06e7ecb49...|48436dade18ac8b2b...|2017-09-19 09:45:35| 58.9|        13.29|
|00018f77f2f0320c5...|            1|e5f2d52b802189ee6...|dd7ddc04e1b6c2c61...|2017-05-03 11:05:13|239.9|        19.93|
+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
only showing top 2 rows

+--------------------+----------------------+-----------+------------+
|           seller_id|seller_zip_code_prefix|seller_city|seller_state|
+--------------------+----------------------+-----------+------------+
|3442f8959a84dea7e...|                 13023|   

In [40]:
# Joining sales with sellers 
# And calculating sum of all the sales 
items = olist_order_items_dataset.select("seller_id","price")
sellers = olist_sellers_dataset.select("seller_id","seller_city","seller_state")
sales_df = items.join(sellers, ['seller_id'], "Inner")
sales_df = sales_df\
    .groupBy('seller_id','seller_state','seller_city')\
    .agg({'price':'sum'})\
    .select('seller_id','seller_state','seller_city',round('sum(price)',2))\
    .withColumnRenamed("round(sum(price), 2)", "sales")
sales_df.show(4)
sales_df.createOrReplaceTempView("join_table")

+--------------------+------------+--------------+--------+
|           seller_id|seller_state|   seller_city|   sales|
+--------------------+------------+--------------+--------+
|7142540dd4c91e223...|          SP|     penapolis|37373.56|
|897060da8b9a21f65...|          SP|ribeirao preto|23023.92|
|318f287a62ab7ac10...|          SP|     sao paulo| 2517.48|
|609e1a9a6c2539919...|          SC|       brusque| 6595.23|
+--------------------+------------+--------------+--------+
only showing top 4 rows



### Top 2 sellers in every state
**using pyspark dataframe**

In [43]:
# Partition by state
window_state = Window.partitionBy('seller_state').orderBy(col('sales').desc())
state_df = sales_df\
    .withColumn('rank',rank().over(window_state))\
    .filter(col('rank')<=2)\
    .select('seller_id','seller_state','sales','rank')
# Show 10 results for testing
state_df.show(6)

+--------------------+------------+--------+----+
|           seller_id|seller_state|   sales|rank|
+--------------------+------------+--------+----+
|04308b1ee57b6625f...|          SC| 60130.6|   1|
|eeb6de78f79159600...|          SC|43739.84|   2|
|3364a91ec4d56c98e...|          RO| 3579.94|   1|
|a5259c149128e82c9...|          RO| 1182.26|   2|
|47efca563408aae19...|          PI|  2522.0|   1|
|327b89b872c14d1c0...|          AM|  1177.0|   1|
+--------------------+------------+--------+----+
only showing top 6 rows



**using pyspark sql**

In [46]:
df = spark.sql("""WITH seller AS (
               SELECT
               seller_id,
               seller_state,
               ROUND(SUM(sales)) AS sum_price
               FROM join_table
               GROUP BY seller_state, seller_id
               )
               SELECT seller_id, seller_state, sum_price
               FROM (SELECT *,
               ROW_NUMBER() OVER(PARTITION BY seller_state ORDER BY sum_price DESC) AS row_number
               FROM seller)
               WHERE row_number < 3
               """)
df.show(4)



+--------------------+------------+---------+
|           seller_id|seller_state|sum_price|
+--------------------+------------+---------+
|04308b1ee57b6625f...|          SC|  60131.0|
|eeb6de78f79159600...|          SC|  43740.0|
|3364a91ec4d56c98e...|          RO|   3580.0|
|a5259c149128e82c9...|          RO|   1182.0|
+--------------------+------------+---------+
only showing top 4 rows



                                                                                

### Top 2 sellers in every city
**using pyspark dataframe**

In [35]:
# Partition by city
window_state = Window.partitionBy('seller_city').orderBy(col('sales').desc())
state_df = sales_df\
    .withColumn('rank',rank().over(window_state))\
    .filter(col('rank')<=2)\
    .select('seller_id','seller_city','sales','rank')
# Show 10 results for testing
state_df.show(10)



+--------------------+----------------+--------+----+
|           seller_id|     seller_city|   sales|rank|
+--------------------+----------------+--------+----+
|da20530872245d6cd...|       igrejinha|  314.96|   1|
|c33847515fa6305ce...|         brusque|15519.85|   1|
|ad97a199236354e53...|         brusque| 13205.9|   2|
|2c4c47cb51acd5ea5...|        buritama|  2575.9|   1|
|f181738b150df1f37...|     carapicuiba|  5529.7|   1|
|f680f85bee2d25355...|     carapicuiba| 5183.92|   2|
|60da8bfa7eebe230b...|fernando prestes|    86.6|   1|
|527801b552d0077ff...|           garca| 17942.8|   1|
|c12b92bf1c350f3e6...|           garca| 4808.78|   2|
|e333046ce6517bd8b...|         ipaussu|  7268.0|   1|
+--------------------+----------------+--------+----+
only showing top 10 rows



**using pyspark sql**

In [48]:
df = spark.sql("""WITH seller AS (
               SELECT
               seller_id,
               seller_city,
               ROUND(SUM(sales)) AS sum_price
               FROM join_table
               GROUP BY seller_city, seller_id
               )
               SELECT seller_id, seller_city, sum_price
               FROM (SELECT *,
               ROW_NUMBER() OVER(PARTITION BY seller_city ORDER BY sum_price DESC) AS row_number
               FROM seller)
               WHERE row_number < 3
               """)
df.show(6)



+--------------------+-----------+---------+
|           seller_id|seller_city|sum_price|
+--------------------+-----------+---------+
|da20530872245d6cd...|  igrejinha|    315.0|
|c33847515fa6305ce...|    brusque|  15520.0|
|ad97a199236354e53...|    brusque|  13206.0|
|2c4c47cb51acd5ea5...|   buritama|   2576.0|
|f181738b150df1f37...|carapicuiba|   5530.0|
|f680f85bee2d25355...|carapicuiba|   5184.0|
+--------------------+-----------+---------+
only showing top 6 rows



