## Task 4 
**Top 2 Highest earning sellers by each location.**</br>
> We have 2 tables with sales history and data with all of the sellers. Sellers are assigned to one location. Our task is to find two sellers with highest profits grouped by location. Because we have information about state and city we will find top sellers grouping by those two categories. We assume that the revenue is equal to price of a product. At the end we will create the map of the whole country with scaled markers according to the number of sales done and revenue gained. 

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import TimestampType
import pyspark.sql.functions as F
from pyspark.sql.window import Window
import folium
from folium import plugins
from IPython.display import clear_output
import pandas as pd

spark = SparkSession.builder.getOrCreate()

22/08/17 21:37:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/17 21:37:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/08/17 21:37:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/08/17 21:37:59 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


In [2]:
# Reading data from parquet and selecting only needed columns
# Orders data
items = spark.read.parquet('/home/jovyan/work/sales_analysis/data_parquet/olist_order_items_dataset.parquet')\
    .select('order_id','seller_id','price')
# Sellers data 
sellers = spark.read.parquet('/home/jovyan/work/sales_analysis/data_parquet/olist_sellers_dataset.parquet')\
    .select('seller_id','seller_zip_code_prefix','seller_city','seller_state')\
    .withColumnRenamed('seller_zip_code_prefix','zip_code')
# Geolocation data 
geo = spark.read.parquet('/home/jovyan/work/sales_analysis/data_parquet/olist_geolocation_dataset.parquet')\
    .select('geolocation_zip_code_prefix','geolocation_lat','geolocation_lng')\
    .withColumnRenamed('geolocation_zip_code_prefix','zip_code')\
    .groupBy('zip_code')\
    .agg({'geolocation_lat':'avg','geolocation_lng':'avg'})\
    .withColumnRenamed('avg(geolocation_lat)','lat')\
    .withColumnRenamed('avg(geolocation_lng)','lng')
# Orders data 
orders = spark.read.parquet('/home/jovyan/work/sales_analysis/data_parquet/olist_orders_dataset.parquet')\
    .select('order_id','order_purchase_timestamp')

# |-- order_id: string (nullable = true)
# |-- seller_id: string (nullable = true)
# |-- price: double (nullable = true)

# |-- seller_id: string (nullable = true)
# |-- seller_zip_code_prefix: integer (nullable = true)
# |-- seller_city: string (nullable = true)
# |-- seller_state: string (nullable = true)

# |-- geolocation_zip_code_prefix: integer (nullable = true)
# |-- lat: double (nullable = true)
# |-- lng: double (nullable = true)

# |-- order_id: string (nullable = true)
# |-- order_purchase_timestamp: string (nullable = true)

                                                                                

In [3]:
# Calculating sum of money earned by sellers 
# and joining sellers table to be able to calculate 
# money earned partitioning by location 
sales = items\
    .groupBy('seller_id')\
    .agg({'price':'sum'})\
    .withColumnRenamed('sum(price)','revenue')\
    .join(sellers,['seller_id'])

### Top 2 sellers in every state
**using pyspark dataframe**

In [4]:
# Partition by state
window_state = Window.partitionBy('seller_state').orderBy(col('revenue').desc())
state_df = sales\
    .withColumn('rank',rank().over(window_state))\
    .filter(col('rank')<=2)\
    .select('seller_id','seller_state','revenue')\
    .orderBy(col('revenue').desc())

### Top 2 sellers in every city
**using pyspark dataframe**

In [5]:
# Partition by city
window_state = Window.partitionBy('seller_city').orderBy(col('revenue').desc())
city_df = sales\
    .withColumn('rank',rank().over(window_state))\
    .filter(col('rank')<=2)\
    .select('seller_id','seller_city','revenue')\
    .orderBy(col('revenue').desc())

In [6]:
# Showing first 12 results from answer. First calculated for states and second for cities. 
print('Top 2 sellers from every state:\n')
state_df.show(12)
print('\nTop 2 sellers from every city:\n')
city_df.show(12)

Top 2 sellers from every state:



                                                                                

+--------------------+------------+------------------+
|           seller_id|seller_state|           revenue|
+--------------------+------------+------------------+
|4869f7a5dfa277a7d...|          SP| 229472.6300000005|
|53243585a1d6dc264...|          BA|222776.05000000002|
|4a3ca9315b744ce9f...|          SP| 200472.9200000013|
|46dc3b2cc0980fb8e...|          RJ|128111.19000000028|
|620c87c171fb2a6dd...|          RJ|114774.50000000041|
|a1043bafd471dff53...|          MG|101901.16000000018|
|ccc4bbb5f32a6ab2b...|          PR|          74004.62|
|04308b1ee57b6625f...|          SC| 60130.59999999999|
|522620dcb18a6b31c...|          PR| 57168.48999999999|
|de722cd6dad950a92...|          PE|55426.099999999926|
|25c5c91f63607446a...|          MG| 54679.21999999999|
|eeb6de78f79159600...|          SC|43739.840000000004|
+--------------------+------------+------------------+
only showing top 12 rows


Top 2 sellers from every city:





+--------------------+----------------+------------------+
|           seller_id|     seller_city|           revenue|
+--------------------+----------------+------------------+
|4869f7a5dfa277a7d...|         guariba| 229472.6300000005|
|53243585a1d6dc264...|lauro de freitas|222776.05000000002|
|4a3ca9315b744ce9f...|        ibitinga| 200472.9200000013|
|fa1c13f2614d7b5c4...|          sumare|194042.03000000038|
|7c67e1448b00f6e96...| itaquaquecetuba|         187923.89|
|7e93a43ef30c4f03f...|         barueri|         176431.87|
|da8622b14eb17ae28...|      piracicaba|160236.57000000114|
|7a67c85e85bb2ce85...|       sao paulo|141745.53000000032|
|1025f0e2d44d7041d...|       sao paulo|138968.55000000022|
|46dc3b2cc0980fb8e...|  rio de janeiro|128111.19000000028|
|620c87c171fb2a6dd...|      petropolis|114774.50000000041|
|7d13fca1522535862...|  ribeirao preto|113628.97000000007|
+--------------------+----------------+------------------+
only showing top 12 rows



                                                                                

### Creating map with sales markers 

In [7]:
# Creating data frame with different locations sum of sales and number of sales 
salesmap_df = sales.join(geo,'zip_code','left').orderBy(col('revenue').desc())
salescity_df = salesmap_df\
    .groupby('seller_city')\
    .agg({'revenue':'sum','lat':'avg','lng':'avg','seller_state':'count'})\
    .orderBy(col('sum(revenue)'))\
    .select('seller_city',
            col('sum(revenue)').alias('revenue'),
            col('count(seller_state)').alias('count'),
            col('avg(lat)').alias('lat'),
            col('avg(lng)').alias('lng'))

salesstate_df = salesmap_df\
    .groupby('seller_state')\
    .agg({'revenue':'sum','lat':'avg','lng':'avg','seller_city':'count'})\
    .orderBy(col('sum(revenue)'))\
    .select('seller_state',
            col('sum(revenue)').alias('revenue'),
            col('count(seller_city)').alias('count'),
            col('avg(lat)').alias('lat'),
            col('avg(lng)').alias('lng'))

In [8]:
salescity_df.write.parquet("/home/jovyan/work/sales_analysis/raport/transformed_data/4_task_salescity_df.parquet",mode="overwrite")
salesstate_df.write.parquet("/home/jovyan/work/sales_analysis/raport/transformed_data/6_task_salesstate_df.parquet",mode="overwrite")

22/08/17 21:38:21 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
                                                                                

In [9]:
# Function for further grouping of our results 
def count_group(x):
    if x>1500:
        return 0
    elif x>200:
        return 1
    elif x>100:
        return 2
    elif x>20:
        return 3
    elif x>10:
        return 4
    else:
        return 5
def sum_group(x):
    if x>8000000:
        return 0
    elif x>1000000:
        return 1
    elif x>500000:
        return 2
    elif x>50000:
        return 3
    elif x>10000:
        return 4
    else:
        return 5

In [10]:
# Collect values from loc_df
# loc_data[i][j] | 0-seller state, 1-revenue, 2-count, 3-lat, 4-lng
REV = 1
COUNT = 2
LAT = 3
LNG = 4

colors = ['#CE4C18','#D58321','#D5B421','#C3E839','#43A85F','#66CF83']
sizes = [25,12,8,5,3,1]
# Collect max num of sales 
loc_data = salesstate_df.collect()
i = 0
# W poprzednim pliku pętla for dla pysparka była spowolniona 
# przez to że collect był wywoływany w każdej iteracji pętli 
# zamiast raz przed pętlą for
# Initialize folium map
sales_map = folium.Map(zoom_start=4,location=[-23.54, -48.91], prefer_canvas=True)
for row in range(salesstate_df.count()):
    folium.CircleMarker(
        location=[loc_data[i][LAT],loc_data[i][LNG]],
        radius=sizes[count_group(loc_data[i][COUNT])],
        color=colors[count_group(loc_data[i][COUNT])],
        fill=True,
        fill_color=colors[count_group(loc_data[i][COUNT])],
        fill_opacity=1,
        popup=loc_data[i][COUNT],
        tooltip=loc_data[i][COUNT]
    ).add_to(sales_map)
    i += 1
sales_map.save('plots/state_sales_map.html')

# Reinitialize folium map
sales_map = folium.Map(zoom_start=4,location=[-23.54, -48.91], prefer_canvas=True)
loc_data = salescity_df.collect()
i = 0
for row in range(salescity_df.count()):
    folium.CircleMarker(
        location=[loc_data[i][LAT],loc_data[i][LNG]],
        radius=sizes[count_group(loc_data[i][COUNT])],
        color=colors[count_group(loc_data[i][COUNT])],
        fill=True,
        fill_color=colors[count_group(loc_data[i][COUNT])],
        fill_opacity=1,
        popup=loc_data[i][COUNT],
        tooltip=loc_data[i][COUNT]
    ).add_to(sales_map)
    i += 1
sales_map.save('plots/city_sales_map.html')

                                                                                

In [11]:
sales_map

# Map with markers scaled by number of sales done in the given city
<img src='sales_map.png'/>

In [12]:
# !pip install Pillow
# !pip install selenium
import io
from PIL import Image

img_data = sales_map._to_png(5)
img = Image.open(io.BytesIO(img_data))
img.save('plots/sales_map.png')

WebDriverException: Message: 'geckodriver' executable needs to be in PATH. 
